Udacity Data Scientist Nanodegree : Prerequisite — SQL(L1, L2)

Lesson 1: Basic SQL / Lesson 2: SQL joins

Entity Relationship Diagrams

An entity relationship diagram (ERD) is a common way to view data in a database. Below is the ERD for the database we will use from Parch & Posey. These diagrams help you visualize the data you are analyzing including:

  1. The columns in each table.
  2. The way the tables work together.

What to Notice

In the Parch & Posey database there are five tables (essentially 5 spreadsheets):

  1. accounts
  2. orders
  3. sales_reps
  4. region

Why Do Data Analysts Use SQL?

  • SQL is easy to understand.
  • Traditional databases allow us to access data directly.
  • Traditional databases allow us to audit and replicate our data.
  • SQL is a great tool for analyzing multiple tables at once.
  • SQL allows you to analyze more complex questions than dashboard tools like Google Analytics.

Video: How Databases Store Data

  1. Data in databases is stored in tables that can be thought of just like Excel spreadsheets.
  2. All the data in the same column must match in terms of data type.
  3. Consistent column types are one of the main reasons working with databases is fast.
    Often databases hold a LOT of data. So, knowing that the columns are all of the same type of data means that obtaining data from a database can still be fast.

Small Differences

Each of these SQL databases may have subtle differences in syntax and available functions — for example, MySQL doesn’t have some of the functions for modifying dates as Postgres. Most of what you see with Postgres will be directly applicable to using SQL in other frameworks and database environments. For the differences that do exist, you should check the documentation. Most SQL environments have great documentation online that you can easily access with a quick Google search.

Types of Statements

The key to SQL is understanding statements. Statement tell the database what you’d like to do with the data. A few statements include:

  1. DROP TABLE is a statement that removes a table in a database.
  2. SELECT allows you to read data and display it. This is called a query.

SELECT & FROM

“SELECT” — 想要讀進什麼樣的資料?(where you put the columns for which you would like to show the data)

  • Which elements from the database do you want to pull?
  1. FROM specifies from which table(s) you want to select the columns. Notice the columns need to exist in this table.

Semicolons

Depending on your SQL environment, your query may need a semicolon at the end to execute. Other environments are more flexible in terms of this being a “requirement.” It is considered best practice to put a semicolon at the end of each statement, which also allows you to run multiple queries at once if your environment allows this.

LIMIT

  • The LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset.
  • The LIMIT command is always the very last part of a query. An example of showing just the first 10 rows of the orders table with all of the columns might look like the following:
SELECT *
FROM orders
LIMIT 10;

ORDER BY

The ORDER BY statement allows us to sort our results using the data in any column. Using ORDER BY in a SQL query only has temporary effects, for the results of that query, unlike sorting a sheet by column in Excel or Sheets.

Pro Tip

Remember DESC can be added after the column in your ORDER BY statement to sort in descending order, as the default is to sort in ascending order.(largest value first)

SELECT account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;

Solutions to previous ORDER BY Questions

  • Write a query that displays the order ID, account ID, and total dollar amount for all the orders, sorted first by the account ID (in ascending order), and then by the total dollar amount (in descending order).
SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY account_id, total_amt_usd DESC;
SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;

WHERE

  • allows you to filter a set of results based on specific criteria
  1. >= (greater than or equal to), <= (less than or equal to)
  2. = (equal to), != (not equal to)
SELECT *
FROM orders
WHERE account_id = 4251
ORDER BY occurred_at
LIMIT 1000;

Arithmetic Operators

Derived Columns

Creating a new column that is a combination of existing columns is known as a derived column (or “calculated” or “computed” column). Usually you want to give a name, or “alias,” to your new column using the AS keyword.

  1. + (Addition)
  2. - (Subtraction)
  3. / (Division)
SELECT id, (standard_amt_usd / total_amt_usd) * 100 AS std_percent
FROM orders
LIMIT 10;

Order of Operations

The following two statements have very different end results:

  1. standard_qty / (standard_qty + gloss_qty + poster_qty)

Introduction to Logical Operators

In the next concepts, you will be learning about Logical Operators. Logical Operators include:

  1. IN This allows you to perform operations similar to using WHERE and =, but for more than one condition.
  2. NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
  3. AND & BETWEEN These allow you to combine operations where all combined conditions must be true.
  4. OR This allows you to combine operations where at least one of the combined conditions must be true.

LIKE

The LIKE operator is extremely useful for working with text. You will use LIKE within a WHERE clause. The LIKE operator is frequently used with %. The % tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters, as we saw with the google syntax above. Remember you will need to use single quotes for the text you pass to the LIKE operator, because of this lower and uppercase letters are not the same within the string. Searching for 'T' is not the same as searching for 't'. In other SQL environments (outside the classroom), you can use either single or double quotes.

SELECT *
FROM orders
WHERE referrer_url LIKE '%google%' > 會找到「夾雜」著 google 的資料

IN

  • Allows you to filter data based on several possible values

Expert Tip

In our Udacity SQL workspaces, note you can include an apostrophe by putting two single quotes together. For example, Macy’s in our workspace would be ‘Macy’’s’.

SELECT name, primary_poc, sales_rep_id
FROM accounts
WHERE name IN ('Walmart', 'Target', 'Nordstrom');

NOT

he NOT operator is an extremely useful operator for working with the previous two operators we introduced: IN and LIKE. By specifying NOT LIKE or NOT IN, we can grab all of the rows that do not meet a particular criteria.

AND

The AND operator is used within a WHERE statement to consider more than one logical clause at a time. Each time you link a new statement with an AND, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (+, *, -, /). LIKE, IN, and NOT logic can also be linked together using the AND operator.

BETWEEN Operator

Sometimes we can make a cleaner statement using BETWEEN than we can using AND. Particularly this is true when we are using the same column for different parts of our AND statement. In the previous video, we probably should have used BETWEEN.

WHERE column >= 6 AND column <= 10
WHERE column BETWEEN 6 AND 10
SELECT name
FROM accounts
WHERE name NOT LIKE 'C%' AND name LIKE '%s';

OR

Similar to the AND operator, the OR operator can combine multiple statements. Each time you link a new statement with an OR, you will need to specify the column you are interested in looking at. You may link as many statements as you would like to consider at the same time. This operator works with all of the operations we have seen so far including arithmetic operators (+, *, -, /), LIKE, IN, NOT, AND, and BETWEEN logic can all be linked together using the OR operator.

Lesson 2: SQL joins

Why Would We Want to Split Data Into Separate Tables?

  • Orders and accounts(訂單和漲戶) are different types of objects
  • This allows queriers to execute more quickly

Database Normalization

When creating a database, it is really important to think about how data will be stored. This is known as normalization, and it is a huge part of most SQL classes. If you are in charge of setting up a new database, it is important to have a thorough understanding of database normalization.

  1. Can I make changes in a single location, rather than in many tables for the same information?
  2. Can I access and manipulate data quickly and efficiently?

Introduction to JOINs(Inner JOIN)

  • 首先,他一樣需要 SELECT / FROM
  • 可以把它想成第二條 FROM 子句。FROM 子句拉的是第一張我們想要拉資料的 table,JOIN 則是第二張 Table。
  • It identifies the table where the data that we want to join lives.
  • 再來需要 ON:指出兩個表格之間的關係。The ON clause specifies the column on which you’d like to merge the two tables together.
SELECT table_column1, table_column2...
FROM table_name1
INNER JOIN table_name2
ON table_name1.column_name=table_name2.column_name;

例子

這是一個客戶資料表(customers)
而這是產品訂單的資料表 orders
此為查詢結果

ERD Reminder(entity relationship diagram)

後面課程會用到
  • If you look at the first few rows of any of the tables in our database, you will notice that this first, PK, column is always unique. For this database it is always called id, but that is not true of all databases.

Primary and Foreign Keys

Primary Key (PK)

A primary key is a unique column in a particular table. This is the first column in each of our tables. Here, those columns are all called id, but that doesn’t necessarily have to be the name. It is common that the primary key is the first column in our tables in most databases.

Foreign Key (FK)

A foreign key is a column in one table that is a primary key in a different table. We can see in the Parch & Posey ERD that the foreign keys are:

  1. account_id
  2. sales_rep_id

Primary — Foreign Key Link

In the above image you can see that:

  1. The region_id is linked to id — this is the primary-foreign key link that connects these two tables.
  2. The crow’s foot shows that the FK can actually appear in many rows in the sales_reps table.
  3. While the single line is telling us that the PK shows that id appears only once per row in this table.

JOIN Revisited

Let’s look back at the first JOIN you wrote.

SELECT orders.*
FROM orders
JOIN accounts
ON orders.account_id = accounts.id;

Notice

Notice our SQL query has the two tables we would like to join — one in the FROM and the other in the JOIN. Then in the ON, we will ALWAYs have the PK equal to the FK:

JOIN More than Two Tables

This same logic can actually assist in joining more than two tables together. Look at the three tables below.

The Code

If we wanted to join all three of these tables, we could use the same logic. The code below pulls all of the data from all of the joined tables.

SELECT *
FROM web_events
JOIN accounts
ON web_events.account_id = accounts.id
JOIN orders
ON accounts.id = orders.account_id
SELECT web_events.channel, accounts.name, orders.total

Alias

SELECT o.*
a.*
FROM orders o # alias 空白鍵加縮寫
JOIN accounts a
ON o.account_id = a.id
FROM tablename AS t1
JOIN tablename2 AS t2
SELECT col1 + col2 AS total, col3
FROM tablename t1
JOIN tablename2 t2
SELECT col1 + col2 total, col3

Aliases for Columns in Resulting Table

While aliasing tables is the most common use case. It can also be used to alias the columns selected to have the resulting table reflect a more readable name.

Select t1.column1 aliasname, t2.column2 aliasname2
FROM tablename AS t1
JOIN tablename2 AS t2

Some code

SELECT r.name region, s.name rep, a.name account  # 把r.name這欄名字改成region
FROM sales_reps s
JOIN region r
ON s.region_id = r.id
JOIN accounts a
ON a.sales_rep_id = s.id
ORDER BY a.name;

LEFT and RIGHT JOINs

Inner join: only return rows that appear in both tables

JOINs

如果出現尚未下訂單的客戶,inner join 就幫不上忙了
inner join 是中間的交集
left join
right join
right join result
left join 跟 right join 可以交換

JOIN Check In

INNER JOINs

We have always pulled rows only if they exist as a match across two tables.

OUTER JOINS

This will return the inner join result set, as well as any unmatched rows from either of the two tables being joined.

JOINs and Filtering

  • Logic in the ON clause reduces the roes before combining the tables
  • Logic in the WHERE clause occurs after the join occurs
SELECT orders.*, account.*
FROM orders
LEFT JOIN accounts
ON orders.account_id = accounts.id
AND accounts.sales_rep_id = 321500

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。