Udacity Data Scientist Nanodegree : Prerequisite — SQL(L1, L2)
Lesson 1: Basic SQL / Lesson 2: SQL joins
Entity Relationship Diagrams
An entity relationship diagram (ERD) is a common way to view data in a database. Below is the ERD for the database we will use from Parch & Posey. These diagrams help you visualize the data you are analyzing including:
- The names of the tables.
- The columns in each table.
- The way the tables work together.
You can think of each of the boxes below as a spreadsheet.
What to Notice
In the Parch & Posey database there are five tables (essentially 5 spreadsheets):
You can think of each of these tables as an individual spreadsheet. Then the columns in each spreadsheet are listed below the table name. For example, the region table has two columns:
name. Alternatively the web_events table has four columns.
Why Do Data Analysts Use SQL?
- SQL is easy to understand.
- Traditional databases allow us to access data directly.
- Traditional databases allow us to audit and replicate our data.
- SQL is a great tool for analyzing multiple tables at once.
- SQL allows you to analyze more complex questions than dashboard tools like Google Analytics.
Video: How Databases Store Data
- Data in databases is stored in tables that can be thought of just like Excel spreadsheets.
- All the data in the same column must match in terms of data type.
- Consistent column types are one of the main reasons working with databases is fast.
Often databases hold a LOT of data. So, knowing that the columns are all of the same type of data means that obtaining data from a database can still be fast.
Each of these SQL databases may have subtle differences in syntax and available functions — for example, MySQL doesn’t have some of the functions for modifying dates as Postgres. Most of what you see with Postgres will be directly applicable to using SQL in other frameworks and database environments. For the differences that do exist, you should check the documentation. Most SQL environments have great documentation online that you can easily access with a quick Google search.
Types of Statements
The key to SQL is understanding statements. Statement tell the database what you’d like to do with the data. A few statements include:
- CREATE TABLE is a statement that creates a new table in a database.
- DROP TABLE is a statement that removes a table in a database.
- SELECT allows you to read data and display it. This is called a query.
SELECT & FROM
“SELECT” — 想要讀進什麼樣的資料？（where you put the columns for which you would like to show the data）
- What data do you want to pull from?
- Which elements from the database do you want to pull?
“From” — tells the query which table to use(What data to use?), where you put the tables from which you would like to pull data
“SELECT”, “FROM” these words are called clauses.
- SELECT indicates which column(s) you want to be given the data for.
- FROM specifies from which table(s) you want to select the columns. Notice the columns need to exist in this table.
If you want to be provided with the data from all columns in the table, you use “*”, like so:
- SELECT * FROM orders
Note that using SELECT does not create a new table with these columns in the database, it just provides the data to you as the results, or output, of this command.
SELECT, FROM 基本上是必須的。
It is common and best practice to capitalize all SQL commands, like SELECT and FROM, and keep everything else in your query lower case.
Depending on your SQL environment, your query may need a semicolon at the end to execute. Other environments are more flexible in terms of this being a “requirement.” It is considered best practice to put a semicolon at the end of each statement, which also allows you to run multiple queries at once if your environment allows this.
- The LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset.
- The LIMIT command is always the very last part of a query. An example of showing just the first 10 rows of the orders table with all of the columns might look like the following:
The ORDER BY statement allows us to sort our results using the data in any column. Using ORDER BY in a SQL query only has temporary effects, for the results of that query, unlike sorting a sheet by column in Excel or Sheets.
In other words, when you use ORDER BY in a SQL query, your output will be sorted that way, but then the next query you run will encounter the unsorted data again.
The ORDER BY statement always comes in a query after the SELECT and FROM statements, but before the LIMIT statement. If you are using the LIMIT statement, it will always appear last. As you learn additional commands, the order of these statements will matter more.
DESC can be added after the column in your ORDER BY statement to sort in descending order, as the default is to sort in ascending order.(largest value first)
SELECT account_id, total_amt_usd
ORDER BY total_amt_usd DESC, account_id;
Solutions to previous ORDER BY Questions
- Write a query that displays the order ID, account ID, and total dollar amount for all the orders, sorted first by the account ID (in ascending order), and then by the total dollar amount (in descending order).
SELECT id, account_id, total_amt_usd
ORDER BY account_id, total_amt_usd DESC;
- Now write a query that again displays order ID, account ID, and total dollar amount for each order, but this time sorted first by total dollar amount (in descending order), and then by account ID (in ascending order).
SELECT id, account_id, total_amt_usd
ORDER BY total_amt_usd DESC, account_id;
Compare the results of these two queries above. How are the results different when you switch the column you sort on first?
In query #1, all of the orders for each account ID are grouped together, and then within each of those groupings, the orders appear from the greatest order amount to the least.
In query #2, since you sorted by the total dollar amount first, the orders appear from greatest to least regardless of which account ID they were from. Then they are sorted by account ID next. (The secondary sorting by account ID is difficult to see here, since only if there were two orders with equal total dollar amounts would there need to be any sorting by account ID.)
- allows you to filter a set of results based on specific criteria
Using the WHERE statement, we can display subsets of tables based on conditions that must be met. You can also think of the WHERE command as filtering the data.
Common symbols used in WHERE statements include:
>=(greater than or equal to),
<=(less than or equal to)
!=(not equal to)
WHERE account_id = 4251
ORDER BY occurred_at
The WHERE statement can also be used with non-numeric data. We can use the
!= operators here. You need to be sure to use single quotes (just be careful if you have quotes in the original text) with the text data, not double quotes.
Commonly when we are using WHERE with non-numeric data fields, we use the LIKE, NOT, or IN operators.
Creating a new column that is a combination of existing columns is known as a derived column (or “calculated” or “computed” column). Usually you want to give a name, or “alias,” to your new column using the AS keyword.
This derived column, and its alias, are generally only temporary, existing just for the duration of your query. The next time you run a query and access this table, the new column will not be there.
If you are deriving the new column from existing columns using a mathematical expression, then these familiar mathematical operators will be useful:
Consider this example:
SELECT id, (standard_amt_usd / total_amt_usd) * 100 AS std_percent
Order of Operations
The following two statements have very different end results:
- Standard_qty / standard_qty + gloss_qty + poster_qty
- standard_qty / (standard_qty + gloss_qty + poster_qty)
Introduction to Logical Operators
In the next concepts, you will be learning about Logical Operators. Logical Operators include:
- LIKE This allows you to perform operations similar to using WHERE and
=, but for cases when you might not know exactly what you are looking for.
- IN This allows you to perform operations similar to using WHERE and
=, but for more than one condition.
- NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
- AND & BETWEEN These allow you to combine operations where all combined conditions must be true.
- OR This allows you to combine operations where at least one of the combined conditions must be true.
The LIKE operator is extremely useful for working with text. You will use LIKE within a WHERE clause. The LIKE operator is frequently used with
% tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters, as we saw with the google syntax above. Remember you will need to use single quotes for the text you pass to the LIKE operator, because of this lower and uppercase letters are not the same within the string. Searching for 'T' is not the same as searching for 't'. In other SQL environments (outside the classroom), you can use either single or double quotes.
WHERE referrer_url LIKE '%google%' > 會找到「夾雜」著 google 的資料
- Allows you to filter data based on several possible values
The IN operator is useful for working with both numeric and text columns. This operator allows you to use an
=, but for more than one item of that particular column. We can check one, two or many column values for which we want to pull data, but all within the same query.
In our Udacity SQL workspaces, note you can include an apostrophe by putting two single quotes together. For example, Macy’s in our workspace would be ‘Macy’’s’.
SELECT name, primary_poc, sales_rep_id
WHERE name IN ('Walmart', 'Target', 'Nordstrom');
he NOT operator is an extremely useful operator for working with the previous two operators we introduced: IN and LIKE. By specifying NOT LIKE or NOT IN, we can grab all of the rows that do not meet a particular criteria.
The AND operator is used within a WHERE statement to consider more than one logical clause at a time. Each time you link a new statement with an AND, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (
/). LIKE, IN, and NOT logic can also be linked together using the AND operator.
Sometimes we can make a cleaner statement using BETWEEN than we can using AND. Particularly this is true when we are using the same column for different parts of our AND statement. In the previous video, we probably should have used BETWEEN.
The BETWEEN operator in SQL is inclusive
Instead of writing :
WHERE column >= 6 AND column <= 10
we can instead write, equivalently:
WHERE column BETWEEN 6 AND 10
WHERE name NOT LIKE 'C%' AND name LIKE '%s';
Similar to the AND operator, the OR operator can combine multiple statements. Each time you link a new statement with an OR, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (
/), LIKE, IN, NOT, AND, and BETWEEN logic can all be linked together using the OR operator.
Lesson 2: SQL joins
Why Would We Want to Split Data Into Separate Tables?
- Orders and accounts(訂單和漲戶) are different types of objects
- This allows queriers to execute more quickly
When creating a database, it is really important to think about how data will be stored. This is known as normalization, and it is a huge part of most SQL classes. If you are in charge of setting up a new database, it is important to have a thorough understanding of database normalization.
There are essentially three ideas that are aimed at database normalization:
- Are the tables storing logical groupings of the data?
- Can I make changes in a single location, rather than in many tables for the same information?
- Can I access and manipulate data quickly and efficiently?
However, most analysts are working with a database that was already set up with the necessary properties in place. As analysts of data, you don’t really need to think too much about data normalization. You just need to be able to pull the data from the database, so you can start making insights.
Introduction to JOINs(Inner JOIN)
- 首先，他一樣需要 SELECT / FROM。
- 可以把它想成第二條 FROM 子句。FROM 子句拉的是第一張我們想要拉資料的 table，JOIN 則是第二張 Table。
- It identifies the table where the data that we want to join lives.
- 再來需要 ON：指出兩個表格之間的關係。The ON clause specifies the column on which you’d like to merge the two tables together.
The whole purpose of JOIN statements is to allow us to pull data from more than one table at a time. JOINs are useful for allowing us to pull data from multiple tables.
We use ON clause to specify a JOIN condition which is a logical statement to combine the table in
SELECT table_column1, table_column2...
INNER JOIN table_name2
ERD Reminder(entity relationship diagram)
- PK: Primary key. A primary key exists in every table, and it is a column that has a unique value for every row.
- If you look at the first few rows of any of the tables in our database, you will notice that this first, PK, column is always unique. For this database it is always called
id, but that is not true of all databases.
Primary and Foreign Keys
Primary Key (PK)
A primary key is a unique column in a particular table. This is the first column in each of our tables. Here, those columns are all called id, but that doesn’t necessarily have to be the name. It is common that the primary key is the first column in our tables in most databases.
Foreign Key (FK)
A foreign key is a column in one table that is a primary key in a different table. We can see in the Parch & Posey ERD that the foreign keys are:
Each of these is linked to the primary key of another table. An example is shown in the image below:
Primary — Foreign Key Link
In the above image you can see that:
- The region_id is the foreign key.
- The region_id is linked to id — this is the primary-foreign key link that connects these two tables.
- The crow’s foot shows that the FK can actually appear in many rows in the sales_reps table.
- While the single line is telling us that the PK shows that id appears only once per row in this table.
Let’s look back at the first JOIN you wrote.
ON orders.account_id = accounts.id;
Here is the ERD for these two tables:
Notice our SQL query has the two tables we would like to join — one in the FROM and the other in the JOIN. Then in the ON, we will ALWAYs have the PK equal to the FK:
The way we join any two tables is in this way: linking the PK and FK (generally in an ON statement).
JOIN More than Two Tables
This same logic can actually assist in joining more than two tables together. Look at the three tables below.
If we wanted to join all three of these tables, we could use the same logic. The code below pulls all of the data from all of the joined tables.
ON web_events.account_id = accounts.id
ON accounts.id = orders.account_id
Alternatively, we can create a SELECT statement that could pull specific columns from any of the three tables. Again, our JOIN holds a table, and ON is a link for our PK to equal the FK.
To pull specific columns, the SELECT statement will need to specify the table that you are wishing to pull the column from, as well as the column name. We could pull only three columns in the above by changing the select statement to the below, but maintaining the rest of the JOIN information:
SELECT web_events.channel, accounts.name, orders.total
We could continue this same process to link all of the tables if we wanted. For efficiency reasons, we probably don’t want to do this unless we actually need information from all of the tables.
FROM orders o # alias 空白鍵加縮寫
JOIN accounts a
ON o.account_id = a.id
When we JOIN tables together, it is nice to give each table an alias. Frequently an alias is just the first letter of the table name. You actually saw something similar for column names in the Arithmetic Operators concept.
FROM tablename AS t1
JOIN tablename2 AS t2
Before, you saw something like:
SELECT col1 + col2 AS total, col3
Frequently, you might also see these statements without the AS statement. Each of the above could be written in the following way instead, and they would still produce the exact same results:
FROM tablename t1
JOIN tablename2 t2
SELECT col1 + col2 total, col3
Aliases for Columns in Resulting Table
While aliasing tables is the most common use case. It can also be used to alias the columns selected to have the resulting table reflect a more readable name.
Select t1.column1 aliasname, t2.column2 aliasname2
FROM tablename AS t1
JOIN tablename2 AS t2
The alias name fields will be what shows up in the returned table instead of t1.column1 and t2.column2
SELECT r.name region, s.name rep, a.name account # 把r.name這欄名字改成region
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
ORDER BY a.name;
LEFT and RIGHT JOINs
Inner join: only return rows that appear in both tables
If there is not matching information in the JOINed table, then you will have columns with empty cells. These empty cells introduce a new data type called NULL. You will learn about NULLs in detail in the next lesson, but for now you have a quick introduction as you can consider any cell without data as NULL.
JOIN Check In
We have always pulled rows only if they exist as a match across two tables.
Our new JOINs allow us to pull rows that might only exist in one of the two tables. This will introduce a new data type called NULL.
This will return the inner join result set, as well as any unmatched rows from either of the two tables being joined.
Again this returns rows that do not match one another from the two tables. The use cases for a full outer join are very rare.
Similar to the above, you might see the language FULL OUTER JOIN, which is the same as OUTER JOIN.
JOINs and Filtering
- Logic in the ON clause reduces the roes before combining the tables
- Logic in the WHERE clause occurs after the join occurs
A simple rule to remember this is that, when the database executes this query, it executes the join and everything in the ON clause first. Think of this as building the new result set. That result set is then filtered using the WHERE clause.
The fact that this example is a left join is important. Because inner joins only return the rows for which the two tables match, moving this filter to the ON clause of an inner join will produce the same result as keeping it in the WHERE clause.
SELECT orders.*, account.*
LEFT JOIN accounts
ON orders.account_id = accounts.id
AND accounts.sales_rep_id = 321500