Complete Machine Learning & Data Science Bootcamp — Udemy(S2-S3)

Section 2: Machine Learning 101 / Section 3: Machine Learning and Data Science Framework

Joe Chao

8 min readApr 21, 2021

Section 2: Machine Learning 101

To sum up what is machine learning, in my perspective, machine learning is to train the computer to act like human.

What is Machine Learning?

Definition:

Using an algorithm to learn about different patterns in data and then taking that algorithm and what it’s learned to make predictions about the future using similar data. Machine learning algorithms are also called models. They find patterns collected in data so we can use those patterns for future problem.
By Stanford University definition: machine learning as the science of getting computers to act without being explicitly programmed that is getting machines to do things without us specifically saying do this then do that

Normal algorithm vs. Machine learning algorithm

AI/Machine Learning/ Data Science

Artificial Intelligence — A human intelligence exhibits by machines and AI is a machine that acts like a human.

Narrow AI — Machines can only do one thing really well, each AI is only good at one task.
General AI — Like human can do many things.

Data science is overlapping the machine learning. It’s looking at a set of data and gaining an understanding of it by comparing different examples, features and, and making visualization like graph. It’s running experiments on a set of data with the hopes of finding actionable insights within it. One of these experiments may be to build a machine learning model.

Types of Machine Learning

Machine Learning — simply about predicting results based on incoming data.

Supervised Learning — Data that we received already has CATEGORIES. Like a CSV files with rows and columns label. We have labeled data and a test data that is label so we know if our function is right or wrong.

Unsupervised Learning — Data without categories. Like a CSV file without column names labeled.

Reinforcement Learning — teaching machine through trial and error through rewards and punishment

Section 3: Machine Learning and Data Science Framework

*Back to this section after finishing whole course*

6 Step Machine Learning Framework

Types of Machine Learning Problems — What problems are we trying to solve?

Supervised Learning — You have data and labels. Try to use the data to predict a label if it guesses the label wrong, the algorithm corrects itself and tries again. The act of correction is why it’s called supervised.

Unsupervised learning: has data but no labels. You provide label they weren’t there to begin with. aka clustering.

Transfer Learning: one machine learning model has learned in another machine learning. e.g. use the car identifying model as the base to transfer it to dog identifying model(Because training a new model is expensive)

Reinforcement learning: give computer punishment and reward after every step. e.g. alpha go

Types of Data — What kinds of data we have?

Structured Data: all the samples are typically in similar format.
Unstructured data: e.g. image, language. In vary formatting.
Static: doesn’t change over time. E.g. CSV
Streaming: changing all the time. e.g. stock price

Types of Evaluations — What defines success for us?

e.g. if your problem is to use patient medical records to classify whether someone has heart disease or not, you need a high accuracy model

Features in Data — What do we already know about the data?

Features is another word for different forms of data.
Numerical features / Categorical features

Derived: when someone looks at the data and creates a new feature using the existing ones.

Modelling — Based on our problem and data, what model should we use?

Splitting Data

Validation split: Once your model had trained you can check its results and see if you can improve them on the validation set.
These three set are separate.

Modelling — Picking the Model

What kind of machine learning algorithm to use with what kind of problem.

After choosing, next step is training. For lining up the input and output.

Modelling — Tuning(Validation set)

Usually, it’s on validation set. However, if you don’t have access to validation set. It can also happen on the training data.
e.g. how a car can be tuned for different styles of driving a model can be tuned for different types of data.
Depending on what kind of model you’re using will depend on what kind of hyper-properties you can chew. e.g. random forest — allow you to change the number of tree.

Modelling Comparison — How will our model perform in the real world?

Generalising — adapts to data it hasn’t seen before. e.g. how heart disease prediction machine learning model would perform at classifying whether a patient has heart disease or not.
Over / underfitting — the model hasn’t been generalised well

The reason of overfitting / underfitting: data leakage and data mismatch
Data leakage: some of your test data leaks into your training data. Testing data always stays the same.
Data dismatch: when the data you’re testing on is different to the data you’re training such as having different features in the training data to the test data

All experiments should be conducted on different portions of your data.

Training data set — Use this set for model training, 70–80% of your data is the standard.
Validation/development data set — Use this set for model hyperparameter tuning and experimentation evaluation, 10–15% of your data is the standard
Test data set — Use this set for model testing and comparison, 10–15% of your data is the standard.

(Copy by the course)

Poor performance on training data means the model hasn’t learned properly and is underfitting. Try a different model, improve the existing one through hyperparameter or collect more data.

Great performance on the training data but poor performance on test data means your model doesn’t generalize well. Your model may be overfitting the training data. Try using a simpler model or making sure your the test data is of the same style your model is training on.

Another form of overfitting can come in the form of better performance on test data than training data. This may mean your testing data is leaking into your training data (incorrect data splits) or you’ve spent too much time optimizing your model for the test set data. Ensure your training and test datasets are kept separate at all times and avoid optimizing a models performance on the test set (use the training and validation sets for model improvement).

Poor performance once deployed (in the real world) means there’s a difference in what you trained and tested your model on and what is actually happening. Ensure the data you’re using during experimentation matches up with the data you’re using in production.