Udacity Data Scientist Nanodegree : Prerequisite — SQL(L1, L2)

Lesson 1: Basic SQL / Lesson 2: SQL joins

Entity Relationship Diagrams

  1. The names of the tables.
  2. The columns in each table.
  3. The way the tables work together.

You can think of each of the boxes below as a spreadsheet.

What to Notice

  1. web_events
  2. accounts
  3. orders
  4. sales_reps
  5. region

You can think of each of these tables as an individual spreadsheet. Then the columns in each spreadsheet are listed below the table name. For example, the region table has two columns: id and name. Alternatively the web_events table has four columns.

Why Do Data Analysts Use SQL?

  • Traditional databases allow us to access data directly.
  • Traditional databases allow us to audit and replicate our data.
  • SQL is a great tool for analyzing multiple tables at once.
  • SQL allows you to analyze more complex questions than dashboard tools like Google Analytics.

Video: How Databases Store Data

  1. All the data in the same column must match in terms of data type.
  2. Consistent column types are one of the main reasons working with databases is fast.
    Often databases hold a LOT of data. So, knowing that the columns are all of the same type of data means that obtaining data from a database can still be fast.

Small Differences

Types of Statements

  1. CREATE TABLE is a statement that creates a new table in a database.
  2. DROP TABLE is a statement that removes a table in a database.
  3. SELECT allows you to read data and display it. This is called a query.

SELECT & FROM

  • What data do you want to pull from?
  • Which elements from the database do you want to pull?

“From” — tells the query which table to use(What data to use?), where you put the tables from which you would like to pull data

“SELECT”, “FROM” these words are called clauses.

  1. SELECT indicates which column(s) you want to be given the data for.
  2. FROM specifies from which table(s) you want to select the columns. Notice the columns need to exist in this table.

If you want to be provided with the data from all columns in the table, you use “*”, like so:

  • SELECT * FROM orders

Note that using SELECT does not create a new table with these columns in the database, it just provides the data to you as the results, or output, of this command.

SELECT, FROM 基本上是必須的。

It is common and best practice to capitalize all SQL commands, like SELECT and FROM, and keep everything else in your query lower case.

Semicolons

LIMIT

  • The LIMIT command is always the very last part of a query. An example of showing just the first 10 rows of the orders table with all of the columns might look like the following:
SELECT *
FROM orders
LIMIT 10;

ORDER BY

In other words, when you use ORDER BY in a SQL query, your output will be sorted that way, but then the next query you run will encounter the unsorted data again.

The ORDER BY statement always comes in a query after the SELECT and FROM statements, but before the LIMIT statement. If you are using the LIMIT statement, it will always appear last. As you learn additional commands, the order of these statements will matter more.

Pro Tip

SELECT account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;

Solutions to previous ORDER BY Questions

SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY account_id, total_amt_usd DESC;
  • Now write a query that again displays order ID, account ID, and total dollar amount for each order, but this time sorted first by total dollar amount (in descending order), and then by account ID (in ascending order).
SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;

Compare the results of these two queries above. How are the results different when you switch the column you sort on first?
In query #1, all of the orders for each account ID are grouped together, and then within each of those groupings, the orders appear from the greatest order amount to the least.

In query #2, since you sorted by the total dollar amount first, the orders appear from greatest to least regardless of which account ID they were from. Then they are sorted by account ID next. (The secondary sorting by account ID is difficult to see here, since only if there were two orders with equal total dollar amounts would there need to be any sorting by account ID.)

WHERE

Using the WHERE statement, we can display subsets of tables based on conditions that must be met. You can also think of the WHERE command as filtering the data.

Common symbols used in WHERE statements include:

  1. > (greater than), < (less than)
  2. >= (greater than or equal to), <= (less than or equal to)
  3. = (equal to), != (not equal to)
SELECT *
FROM orders
WHERE account_id = 4251
ORDER BY occurred_at
LIMIT 1000;

The WHERE statement can also be used with non-numeric data. We can use the = and != operators here. You need to be sure to use single quotes (just be careful if you have quotes in the original text) with the text data, not double quotes.

Commonly when we are using WHERE with non-numeric data fields, we use the LIKE, NOT, or IN operators.

Arithmetic Operators

Derived Columns

This derived column, and its alias, are generally only temporary, existing just for the duration of your query. The next time you run a query and access this table, the new column will not be there.

If you are deriving the new column from existing columns using a mathematical expression, then these familiar mathematical operators will be useful:

  1. * (Multiplication)
  2. + (Addition)
  3. - (Subtraction)
  4. / (Division)

Consider this example:

SELECT id, (standard_amt_usd / total_amt_usd) * 100 AS std_percent
FROM orders
LIMIT 10;

Order of Operations

  1. Standard_qty / standard_qty + gloss_qty + poster_qty
  2. standard_qty / (standard_qty + gloss_qty + poster_qty)

Introduction to Logical Operators

  1. LIKE This allows you to perform operations similar to using WHERE and =, but for cases when you might not know exactly what you are looking for.
  2. IN This allows you to perform operations similar to using WHERE and =, but for more than one condition.
  3. NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
  4. AND & BETWEEN These allow you to combine operations where all combined conditions must be true.
  5. OR This allows you to combine operations where at least one of the combined conditions must be true.

LIKE

SELECT *
FROM orders
WHERE referrer_url LIKE '%google%' > 會找到「夾雜」著 google 的資料

IN

The IN operator is useful for working with both numeric and text columns. This operator allows you to use an =, but for more than one item of that particular column. We can check one, two or many column values for which we want to pull data, but all within the same query.

Expert Tip

SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name IN ('Walmart', 'Target', 'Nordstrom');

NOT

AND

BETWEEN Operator

The BETWEEN operator in SQL is inclusive

Instead of writing :

WHERE column >= 6 AND column <= 10

we can instead write, equivalently:

WHERE column BETWEEN 6 AND 10

Example

SELECT name
FROM accounts
WHERE name NOT LIKE 'C%' AND name LIKE '%s';

OR

Lesson 2: SQL joins

Why Would We Want to Split Data Into Separate Tables?

  • This allows queriers to execute more quickly

Database Normalization

There are essentially three ideas that are aimed at database normalization:

  1. Are the tables storing logical groupings of the data?
  2. Can I make changes in a single location, rather than in many tables for the same information?
  3. Can I access and manipulate data quickly and efficiently?

However, most analysts are working with a database that was already set up with the necessary properties in place. As analysts of data, you don’t really need to think too much about data normalization. You just need to be able to pull the data from the database, so you can start making insights.

Introduction to JOINs(Inner JOIN)

  • 可以把它想成第二條 FROM 子句。FROM 子句拉的是第一張我們想要拉資料的 table,JOIN 則是第二張 Table。
  • It identifies the table where the data that we want to join lives.
  • 再來需要 ON:指出兩個表格之間的關係。The ON clause specifies the column on which you’d like to merge the two tables together.

The whole purpose of JOIN statements is to allow us to pull data from more than one table at a time. JOINs are useful for allowing us to pull data from multiple tables.

We use ON clause to specify a JOIN condition which is a logical statement to combine the table in FROM and JOIN statements.

SELECT table_column1, table_column2...
FROM table_name1
INNER JOIN table_name2
ON table_name1.column_name=table_name2.column_name;

例子

這是一個客戶資料表(customers)
而這是產品訂單的資料表 orders
此為查詢結果

ERD Reminder(entity relationship diagram)

後面課程會用到
  • PK: Primary key. A primary key exists in every table, and it is a column that has a unique value for every row.
  • If you look at the first few rows of any of the tables in our database, you will notice that this first, PK, column is always unique. For this database it is always called id, but that is not true of all databases.

Primary and Foreign Keys

Primary Key (PK)

Foreign Key (FK)

  1. region_id
  2. account_id
  3. sales_rep_id

Each of these is linked to the primary key of another table. An example is shown in the image below:

Primary — Foreign Key Link

  1. The region_id is the foreign key.
  2. The region_id is linked to id — this is the primary-foreign key link that connects these two tables.
  3. The crow’s foot shows that the FK can actually appear in many rows in the sales_reps table.
  4. While the single line is telling us that the PK shows that id appears only once per row in this table.

JOIN Revisited

SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;

Here is the ERD for these two tables:

Notice

The way we join any two tables is in this way: linking the PK and FK (generally in an ON statement).

JOIN More than Two Tables

The Code

SELECT *
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id

Alternatively, we can create a SELECT statement that could pull specific columns from any of the three tables. Again, our JOIN holds a table, and ON is a link for our PK to equal the FK.

To pull specific columns, the SELECT statement will need to specify the table that you are wishing to pull the column from, as well as the column name. We could pull only three columns in the above by changing the select statement to the below, but maintaining the rest of the JOIN information:

SELECT web_events.channel, accounts.name, orders.total

We could continue this same process to link all of the tables if we wanted. For efficiency reasons, we probably don’t want to do this unless we actually need information from all of the tables.

Alias

SELECT o.*
a.*
FROM orders o # alias 空白鍵加縮寫
JOIN accounts a
ON o.account_id = a.id

When we JOIN tables together, it is nice to give each table an alias. Frequently an alias is just the first letter of the table name. You actually saw something similar for column names in the Arithmetic Operators concept.

Example:

FROM tablename AS t1
JOIN tablename2 AS t2

Before, you saw something like:

SELECT col1 + col2 AS total, col3

Frequently, you might also see these statements without the AS statement. Each of the above could be written in the following way instead, and they would still produce the exact same results:

FROM tablename t1
JOIN tablename2 t2

and

SELECT col1 + col2 total, col3

Aliases for Columns in Resulting Table

Example:

Select t1.column1 aliasname, t2.column2 aliasname2
FROM tablename AS t1
JOIN tablename2 AS t2

The alias name fields will be what shows up in the returned table instead of t1.column1 and t2.column2

Some code

SELECT r.name region, s.name rep, a.name account  # 把r.name這欄名字改成region
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
ORDER BY a.name;

LEFT and RIGHT JOINs

Inner join: only return rows that appear in both tables

If there is not matching information in the JOINed table, then you will have columns with empty cells. These empty cells introduce a new data type called NULL. You will learn about NULLs in detail in the next lesson, but for now you have a quick introduction as you can consider any cell without data as NULL.

如果出現尚未下訂單的客戶,inner join 就幫不上忙了
inner join 是中間的交集
left join
right join
right join result
left join 跟 right join 可以交換

JOIN Check In

INNER JOINs

Our new JOINs allow us to pull rows that might only exist in one of the two tables. This will introduce a new data type called NULL.

OUTER JOINS

Again this returns rows that do not match one another from the two tables. The use cases for a full outer join are very rare.

Similar to the above, you might see the language FULL OUTER JOIN, which is the same as OUTER JOIN.

JOINs and Filtering

  • Logic in the WHERE clause occurs after the join occurs

A simple rule to remember this is that, when the database executes this query, it executes the join and everything in the ON clause first. Think of this as building the new result set. That result set is then filtered using the WHERE clause.

The fact that this example is a left join is important. Because inner joins only return the rows for which the two tables match, moving this filter to the ON clause of an inner join will produce the same result as keeping it in the WHERE clause.

SELECT orders.*, account.*
FROM orders
LEFT JOIN accounts
ON orders.account_id = accounts.id
AND accounts.sales_rep_id = 321500

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。