Udacity Data Scientist Nanodegree : Prerequisite — Practical Statistics(L1, L2)

Lesson 1 / 2: Descriptive Statistics

What is data?

Data Types

  • Continuous data can be split into smaller and smaller units, and still a smaller unit exists. An example of this is the age.
  • Discrete data only takes on countable values. The number of dogs.

Categorical are used to label a group or set of items (like dog breeds — Collies, Labs, Poodles, etc.). We can divide categorical data further into two types: Ordinal and Nominal.

  • Categorical Ordinal data take on a ranked ordering (like a ranked interaction on a scale from Very Poor to Very Good with the dogs).
  • Categorical Nominal data do not have an order or ranking (like the breeds of the dog).

Quantitative variables

Measures of Centre (3M’s: Mean, Median, Mode)

Mean

  • 有離群值不適用、且會把答案變成小數點。

Median

  • Median of odd value(Direct middle): n = 7 > 7 + 1 / 2 = 4 > 第四個值是中位數。
  • Median of even value(The average of the two value in the middle): n = 8,8 / 2 = 4 > 第四個與第五個值去做平均就是中位數。

Mode

  • 可能沒有,也可能有很多個。

What is Notation?

  • Think of notation as a universal language used by academic and industry professionals to convey mathematical ideas.
  • 很重要。能夠簡單化你想傳達的東西,讓你能夠閱讀許多演算法相關的文件 .etc

Random Variables

  • Def: Placeholder for the possible values of some process. Notation=X
  • X is an entire set of possible values.

Capital vs. Lower

R.V. — Capital letter, Observed values — Lowercase letters

Summation

Measures of Spread — How far are points from one another

  • Interquartile Range (IQR)
  • Standard Deviation
  • Variance

Histogram — The most common visual for quantitative data

5 Number Summary

  • 怎麼找呢?先排序,就可以找到最大值、最小值跟中位數(i.e. Q2)。
  • Q1跟 Q3 則是 Q2 兩側的中位數。
  • Box plot — Useful for quickly comparing the spread of two data sets. For datasets that are not symmetric, the five number summary and a corresponding box plot are a great way to get started with understanding the spread of your data. Although I still prefer a histogram in most cases, box plots can be easier to compare two or more groups.

Standard Deviation and Variance

母體的標準差。

Important Final Points

  1. 當比較兩個資料集的分散趨勢時,一定要同單位。
  2. 當資料與錢或者是經濟學有相關,高變異數或標準差通常與高風險有關。
  3. 實務上,標準差比變異數更常使用,因為他是原單位(變異數則是單位的平方)

Shape of Distribution

最常見的對稱分佈:mean = median = mode

Outliers

Common Techniques

1. Noting they exist and the impact on summary statistics.

2. If typo — remove or fix

3. Understanding why they exist, and the impact on questions we are trying to answer about our data.

4. Reporting the 5 number summary values is often a better indication than measures like the mean and standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Outliers Advice

2. Handle outliers accordingly via the methods above.

3. If no outliers and your data follow a normal distribution — use the mean and standard deviation to describe your dataset, and report that the data are normally distributed.

Side note

4. If you have skewed data or outliers, use the five number summary to summarize your data and report the outliers.

敘述統計與推論統計

Descriptive Statistics

Inferential Statistics

We looked at specific examples that allowed us to identify the

  1. Population — our entire group of interest.
  2. Parameter — numeric summary about a population
  3. Sample — subset of the population
  4. Statistic numeric summary about a sample

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。