Complete Machine Learning & Data Science Bootcamp — Udemy(S2-S3)

Section 2: Machine Learning 101 / Section 3: Machine Learning and Data Science Framework

Section 2: Machine Learning 101

What is Machine Learning?

  • Using an algorithm to learn about different patterns in data and then taking that algorithm and what it’s learned to make predictions about the future using similar data. Machine learning algorithms are also called models. They find patterns collected in data so we can use those patterns for future problem.
  • By Stanford University definition: machine learning as the science of getting computers to act without being explicitly programmed that is getting machines to do things without us specifically saying do this then do that
Normal algorithm vs. Machine learning algorithm

AI/Machine Learning/ Data Science

  • Narrow AI — Machines can only do one thing really well, each AI is only good at one task.
  • General AI — Like human can do many things.
Venn diagram

Types of Machine Learning

Section 3: Machine Learning and Data Science Framework

6 Step Machine Learning Framework

Types of Machine Learning Problems — What problems are we trying to solve?

  • Supervised Learning — You have data and labels. Try to use the data to predict a label if it guesses the label wrong, the algorithm corrects itself and tries again. The act of correction is why it’s called supervised.
Regression: try to predict numbers
  • Unsupervised learning: has data but no labels. You provide label they weren’t there to begin with. aka clustering.
  • Transfer Learning: one machine learning model has learned in another machine learning. e.g. use the car identifying model as the base to transfer it to dog identifying model(Because training a new model is expensive)
  • Reinforcement learning: give computer punishment and reward after every step. e.g. alpha go
Punishment or reward

Types of Data — What kinds of data we have?

  • Structured Data: all the samples are typically in similar format.
  • Unstructured data: e.g. image, language. In vary formatting.
  • Static: doesn’t change over time. E.g. CSV
  • Streaming: changing all the time. e.g. stock price

Types of Evaluations — What defines success for us?


Features in Data — What do we already know about the data?

  • Features is another word for different forms of data.
  • Numerical features / Categorical features
  • Derived: when someone looks at the data and creates a new feature using the existing ones.

Modelling — Based on our problem and data, what model should we use?

Splitting Data

  • Validation split: Once your model had trained you can check its results and see if you can improve them on the validation set.
  • These three set are separate.
The proportion usually like this.

Modelling — Picking the Model

  • What kind of machine learning algorithm to use with what kind of problem.
  • After choosing, next step is training. For lining up the input and output.

Modelling — Tuning(Validation set)

  • Usually, it’s on validation set. However, if you don’t have access to validation set. It can also happen on the training data.
  • e.g. how a car can be tuned for different styles of driving a model can be tuned for different types of data.
  • Depending on what kind of model you’re using will depend on what kind of hyper-properties you can chew. e.g. random forest — allow you to change the number of tree.

Modelling Comparison — How will our model perform in the real world?

  • Generalising — adapts to data it hasn’t seen before. e.g. how heart disease prediction machine learning model would perform at classifying whether a patient has heart disease or not.
  • Over / underfitting — the model hasn’t been generalised well
  • The reason of overfitting / underfitting: data leakage and data mismatch
  • Data leakage: some of your test data leaks into your training data. Testing data always stays the same.
  • Data dismatch: when the data you’re testing on is different to the data you’re training such as having different features in the training data to the test data
  • Training data set — Use this set for model training, 70–80% of your data is the standard.
  • Validation/development data set — Use this set for model hyperparameter tuning and experimentation evaluation, 10–15% of your data is the standard
  • Test data set — Use this set for model testing and comparison, 10–15% of your data is the standard.

Experimentation — How could we improve/what can we try next?

Tools we’ll use



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joe Chao