Udacity Data Scientist Nanodegree : Prerequisite — SQL(L3, L4, L5)

Lesson 3: Aggregating / Lesson 4: Subquery, temporary tables / Lesson 5: Data Cleaning

Joe Chao
6 min readApr 17, 2021

NULL — No data

WHERE variable IS NULL    
  • When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don’t use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.

COUNT

  • 僅回傳 Non-Null data
  • 算總共有幾個
SELECT COUNT(*)
FROM accounts;

SUM

  • 不能用 SUM(*)
  • Unlike COUNT, you can only use SUM on numeric columns. However, SUM will ignore NULL values.
  • SUM treats NULL as 0

Aggregation Reminder

An important thing to remember: aggregators only aggregate vertically — the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.

MIN / MAX

  • 也無視 NULL values
  • Depending on the column type, MIN will return the lowest number, earliest date, or non-numerical value as early in the alphabet as possible. As you might suspect, MAX does the opposite — it returns the highest number, the latest date, or the non-numerical value closest alphabetically to “Z.”

AVG — Average

  • This aggregate function again ignores the NULL values in both the numerator and the denominator.
  • If you want to count NULLs as zero, you will need to use SUM and COUNT.

Tips — Median

  • 通常是拿來計算中央趨勢的最好參數
  • 但是 SQL 計算中位數很困難,難到會被當作是面試題的程度
SELECT *
FROM (SELECT total_amt_usd
FROM orders
ORDER BY total_amt_usd
LIMIT 3457) AS Table1
ORDER BY total_amt_usd DESC
LIMIT 2;

GROUP BY

  • Allow creating segments that will aggregate independent from one another
  • 當你在聚合(aggregate)的時候,其實是在有效的折疊(collapse)返回的行數
所以這個程式碼會有錯,因為只有後三個被折疊到
  • 由上圖,SQL 不清楚到底要怎麼處理 account_id,要分組呢?還是算總合呢?這也是用 GROUP BY 的時間。如果沒有經過 aggregate 且沒有分組的,會回傳錯誤。
  • 夾在 WHERE 跟 ORDER BY 中間

GROUP BY — Expert Tip (看不懂)

Before we dive deeper into aggregations using GROUP BY statements, it is worth noting that SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result — no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results.

This is actually a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different. The above query’s results exceed 100 rows, so it’s a perfect example. In the next concept, use the SQL environment to try removing the LIMIT and running it again to see what changes.

DISTINCT

DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.

You could write:

SELECT DISTINCT column1, column2, column3
FROM table1;

which would return the unique (or DISTINCT) rows across all three columns.

You would not write:

SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3
FROM table1;

You can think of DISTINCT the same way you might think of the statement “unique”.

DISTINCT — Expert Tip

It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.

HAVING

  • WHERE 不允許你用 aggregating function 去當條件,這時我們就需要 HAVING

HAVING — Expert Tip

HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.

DATE function

  • GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second

DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common truncations are day, month, and year. Here is a great blog post by Mode Analytics on the power of this function.

DATE_PART can be useful for pulling a specific portion of a date, but notice pulling month or day of the week (dow) means that you are no longer keeping the years in order. Rather you are grouping for certain components regardless of which year they belonged in.

Lesson 4: Subquery

Subquery

WITH — CTE

  • The WITH statement is often called a Common Table Expression or CTE.
跟上面那張圖一樣,更快

Lesson 5: Data cleaning

LEFT & RIGHT

Here we looked at three new functions:

  1. LEFT
  2. RIGHT
  3. LENGTH

LEFT pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left). As you saw here, you can pull the first three digits of a phone number using LEFT(phone_number, 3).

RIGHT pulls a specified number of characters for each row in a specified column starting at the end (or from the right). As you saw here, you can pull the last eight digits of a phone number using RIGHT(phone_number, 8).

LENGTH provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as LENGTH(phone_number).

POSITION, STRPOS, & SUBSTR

In this lesson, you learned about:

  1. POSITION
  2. STRPOS
  3. LOWER
  4. UPPER

POSITION takes a character and a column, and provides the index where that character is for each row. The index of the first position is 1 in SQL. If you come from another programming language, many begin indexing at 0. Here, you saw that you can pull the index of a comma as POSITION(‘,’ IN city_state).

STRPOS provides the same result as POSITION, but the syntax for achieving those results is a bit different as shown here: STRPOS(city_state, ‘,’).

Note, both POSITION and STRPOS are case sensitive, so looking for A is different than looking for a.

Therefore, if you want to pull an index regardless of the case of a letter, you might want to use LOWER or UPPER to make all of the characters lower or uppercase.

CONCAT

In this lesson you learned about:

  1. CONCAT
  2. Piping ||

Each of these will allow you to combine columns together across rows. In this video, you saw how first and last names stored in separate columns could be combined together to create a full name: CONCAT(first_name, ‘ ‘, last_name) or with piping as first_name || ‘ ‘ || last_name.

CAST

In this video, you saw additional functionality for working with dates including:

  1. TO_DATE
  2. CAST
  3. Casting with ::

DATE_PART(‘month’, TO_DATE(month, ‘month’)) here changed a month name into the number associated with that particular month.

Then you can change a string to a date using CAST. CAST is actually useful to change lots of column types. Commonly you might be doing as you saw here, where you change a string to a date using CAST(date_column AS DATE). However, you might want to make other changes to your columns in terms of their data types. You can see other examples here.

In this example, you also saw that instead of CAST(date_column AS DATE), you can use date_column::DATE.

COALESCE = IFNULL

In general, COALESCE returns the first non-NULL value passed for each row.

--

--

Joe Chao

會計背景,但目前在管顧實習。喜歡很多事情,於是都選擇把它一一記錄下來。