Udacity Data Scientist Nanodegree : Prerequisite — SQL(L3, L4, L5)
Lesson 3: Aggregating / Lesson 4: Subquery, temporary tables / Lesson 5: Data Cleaning
NULL — No data
WHERE variable IS NULL
- When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don’t use
=, because NULL isn't considered a value in SQL. Rather, it is a property of the data.
- 僅回傳 Non-Null data
- 不能用 SUM(*)
- Unlike COUNT, you can only use SUM on numeric columns. However, SUM will ignore NULL values.
- SUM treats NULL as 0
An important thing to remember: aggregators only aggregate vertically — the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.
MIN / MAX
- 也無視 NULL values
- Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite — it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”
AVG — Average
- This aggregate function again ignores the NULL values in both the numerator and the denominator.
- If you want to count NULLs as zero, you will need to use SUM and COUNT.
Tips — Median
- 但是 SQL 計算中位數很困難，難到會被當作是面試題的程度
FROM (SELECT total_amt_usd
ORDER BY total_amt_usd
LIMIT 3457) AS Table1
ORDER BY total_amt_usd DESC
- Allow creating segments that will aggregate independent from one another
- 由上圖，SQL 不清楚到底要怎麼處理 account_id，要分組呢？還是算總合呢？這也是用 GROUP BY 的時間。如果沒有經過 aggregate 且沒有分組的，會回傳錯誤。
- 夾在 WHERE 跟 ORDER BY 中間
GROUP BY — Expert Tip （看不懂）
Before we dive deeper into aggregations using GROUP BY statements, it is worth noting that SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result — no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results.
This is actually a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different. The above query’s results exceed 100 rows, so it’s a perfect example. In the next concept, use the SQL environment to try removing the LIMIT and running it again to see what changes.
DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.
You could write:
SELECT DISTINCT column1, column2, column3
which would return the unique (or DISTINCT) rows across all three columns.
You would not write:
SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3
You can think of DISTINCT the same way you might think of the statement “unique”.
DISTINCT — Expert Tip
It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.
- WHERE 不允許你用 aggregating function 去當條件，這時我們就需要 HAVING
HAVING — Expert Tip
HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.
- GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second
DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common truncations are
year. Here is a great blog post by Mode Analytics on the power of this function.
DATE_PART can be useful for pulling a specific portion of a date, but notice pulling
month or day of the week (
dow) means that you are no longer keeping the years in order. Rather you are grouping for certain components regardless of which year they belonged in.
Lesson 4: Subquery
WITH — CTE
- The WITH statement is often called a Common Table Expression or CTE.
Lesson 5: Data cleaning
LEFT & RIGHT
Here we looked at three new functions:
LEFT pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left). As you saw here, you can pull the first three digits of a phone number using LEFT(phone_number, 3).
RIGHT pulls a specified number of characters for each row in a specified column starting at the end (or from the right). As you saw here, you can pull the last eight digits of a phone number using RIGHT(phone_number, 8).
LENGTH provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as LENGTH(phone_number).
POSITION, STRPOS, & SUBSTR
In this lesson, you learned about:
POSITION takes a character and a column, and provides the index where that character is for each row. The index of the first position is 1 in SQL. If you come from another programming language, many begin indexing at 0. Here, you saw that you can pull the index of a comma as POSITION(‘,’ IN city_state).
STRPOS provides the same result as POSITION, but the syntax for achieving those results is a bit different as shown here: STRPOS(city_state, ‘,’).
Note, both POSITION and STRPOS are case sensitive, so looking for A is different than looking for a.
Therefore, if you want to pull an index regardless of the case of a letter, you might want to use LOWER or UPPER to make all of the characters lower or uppercase.
In this lesson you learned about:
Each of these will allow you to combine columns together across rows. In this video, you saw how first and last names stored in separate columns could be combined together to create a full name: CONCAT(first_name, ‘ ‘, last_name) or with piping as first_name || ‘ ‘ || last_name.
In this video, you saw additional functionality for working with dates including:
- Casting with
DATE_PART(‘month’, TO_DATE(month, ‘month’)) here changed a month name into the number associated with that particular month.
Then you can change a string to a date using CAST. CAST is actually useful to change lots of column types. Commonly you might be doing as you saw here, where you change a
string to a
date using CAST(date_column AS DATE). However, you might want to make other changes to your columns in terms of their data types. You can see other examples here.
In this example, you also saw that instead of CAST(date_column AS DATE), you can use date_column::DATE.
COALESCE = IFNULL
In general, COALESCE returns the first non-NULL value passed for each row.