Complete Machine Learning & Data Science Bootcamp — Udemy(S2-S3)

Section 2: Machine Learning 101 / Section 3: Machine Learning and Data Science Framework

Section 2: Machine Learning 101

What is Machine Learning?

  • Using an algorithm to learn about different patterns in data and then taking that algorithm and what it’s learned to make predictions about the future using similar data. Machine learning algorithms are also called models. They find patterns collected in data so we can use those patterns for future problem.
  • By Stanford University definition: machine learning as the science of getting computers to act without being explicitly programmed that is getting machines to do things without us specifically saying do this then do that
Normal algorithm vs. Machine learning algorithm

AI/Machine Learning/ Data Science

  • Narrow AI — Machines can only do one thing really well, each AI is only good at one task.
  • General AI — Like human can do many things.

Data science is overlapping the machine learning. It’s looking at a set of data and gaining an understanding of it by comparing different examples, features and, and making visualization like graph. It’s running experiments on a set of data with the hopes of finding actionable insights within it. One of these experiments may be to build a machine learning model.

Venn diagram

Types of Machine Learning

Supervised Learning — Data that we received already has CATEGORIES. Like a CSV files with rows and columns label. We have labeled data and a test data that is label so we know if our function is right or wrong.

Unsupervised Learning — Data without categories. Like a CSV file without column names labeled.

Reinforcement Learning — teaching machine through trial and error through rewards and punishment

Section 3: Machine Learning and Data Science Framework

6 Step Machine Learning Framework

Types of Machine Learning Problems — What problems are we trying to solve?

  • Supervised Learning — You have data and labels. Try to use the data to predict a label if it guesses the label wrong, the algorithm corrects itself and tries again. The act of correction is why it’s called supervised.
Regression: try to predict numbers
  • Unsupervised learning: has data but no labels. You provide label they weren’t there to begin with. aka clustering.
  • Transfer Learning: one machine learning model has learned in another machine learning. e.g. use the car identifying model as the base to transfer it to dog identifying model(Because training a new model is expensive)
  • Reinforcement learning: give computer punishment and reward after every step. e.g. alpha go
Punishment or reward

Types of Data — What kinds of data we have?

  • Unstructured data: e.g. image, language. In vary formatting.
  • Static: doesn’t change over time. E.g. CSV
  • Streaming: changing all the time. e.g. stock price

Types of Evaluations — What defines success for us?

example

Features in Data — What do we already know about the data?

  • Numerical features / Categorical features
  • Derived: when someone looks at the data and creates a new feature using the existing ones.

Modelling — Based on our problem and data, what model should we use?

Splitting Data

  • Validation split: Once your model had trained you can check its results and see if you can improve them on the validation set.
  • These three set are separate.
The proportion usually like this.

Modelling — Picking the Model

  • After choosing, next step is training. For lining up the input and output.

Modelling — Tuning(Validation set)

  • e.g. how a car can be tuned for different styles of driving a model can be tuned for different types of data.
  • Depending on what kind of model you’re using will depend on what kind of hyper-properties you can chew. e.g. random forest — allow you to change the number of tree.

Modelling Comparison — How will our model perform in the real world?

  • Over / underfitting — the model hasn’t been generalised well
  • The reason of overfitting / underfitting: data leakage and data mismatch
  • Data leakage: some of your test data leaks into your training data. Testing data always stays the same.
  • Data dismatch: when the data you’re testing on is different to the data you’re training such as having different features in the training data to the test data

All experiments should be conducted on different portions of your data.

  • Training data set — Use this set for model training, 70–80% of your data is the standard.
  • Validation/development data set — Use this set for model hyperparameter tuning and experimentation evaluation, 10–15% of your data is the standard
  • Test data set — Use this set for model testing and comparison, 10–15% of your data is the standard.

(Copy by the course)

Poor performance on training data means the model hasn’t learned properly and is underfitting. Try a different model, improve the existing one through hyperparameter or collect more data.

Great performance on the training data but poor performance on test data means your model doesn’t generalize well. Your model may be overfitting the training data. Try using a simpler model or making sure your the test data is of the same style your model is training on.

Another form of overfitting can come in the form of better performance on test data than training data. This may mean your testing data is leaking into your training data (incorrect data splits) or you’ve spent too much time optimizing your model for the test set data. Ensure your training and test datasets are kept separate at all times and avoid optimizing a models performance on the test set (use the training and validation sets for model improvement).

Poor performance once deployed (in the real world) means there’s a difference in what you trained and tested your model on and what is actually happening. Ensure the data you’re using during experimentation matches up with the data you’re using in production.

Experimentation — How could we improve/what can we try next?

Tools we’ll use

…I found the note by the instructor on Medium, here is the link.

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。