Udacity Data Scientist Nanodegree : Prerequisite — Python(L7)

Lesson 7: Pandas

Intro

Why use Pandas?

Pandas Series

import pandas as pd   # pd - convention
import pandas as pd# We create a Pandas Series that stores a grocery list
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])
# We display the Groceries Pandas Series
groceries

eggs 30
apples 6
milk Yes
bread No
dtype: object

# We print some information about Groceries
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements

# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

The data in Groceries is: [30 6 ‘Yes’ ‘No’]
The index of Groceries is: Index([‘eggs’, ‘apples’, ‘milk’, ‘bread’], dtype=’object’)

# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries
# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries
# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True

Accessing and Deleting Elements in Pandas Series

# We access elements in Groceries using index labels:# We use a single index label
print('How many eggs do we need to buy:', groceries['eggs'])
# we can access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']])
# we use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']])
# We access elements in Groceries using numerical indices:# we use multiple numerical indices
print('How many eggs and apples do we need to buy:\n', groceries[[0, 1]])
# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]])
# We use a single numerical index
print('How many eggs do we need to buy:', groceries[0])
# we use iloc to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]])

How many eggs do we need to buy: 30

Do we need milk and bread:
milk Yes
bread No
dtype: object

How many eggs and apples do we need to buy:
eggs 30
apples 6
dtype: object

How many eggs and apples do we need to buy:
eggs 30
apples 6
dtype: object

Do we need bread:
bread No
dtype: object

How many eggs do we need to buy: 30

Do we need milk and bread:
milk Yes
bread No
dtype: object

# We display the original grocery list
print('Original Grocery List:\n', groceries)
# We change the number of eggs to 2
groceries['eggs'] = 2
# We display the changed grocery list
print()
print('Modified Grocery List:\n', groceries)

Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: object

Modified Grocery List:
eggs 2
apples 6
milk Yes
bread No
dtype: object

# We display the original grocery list
print('Original Grocery List:\n', groceries)
# We remove apples from our grocery list. The drop function removes elements out of place
print('We remove apples (out of place):\n', groceries.drop('apples'))
# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print('Grocery List after removing apples out of place:\n', groceries)

Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: object

We remove apples (out of place):
eggs 30
milk Yes
bread No
dtype: object

Grocery List after removing apples out of place:
eggs 30
apples 6
milk Yes
bread No
dtype: object

# We display the original grocery list
print('Original Grocery List:\n', groceries)
# We remove apples from our grocery list in place by setting the inplace keyword to True
groceries.drop('apples', inplace = True)
# When we remove elements in place the original Series its modified. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples in place:\n', groceries)

Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: object

Grocery List after removing apples in place:
eggs 30
milk Yes
bread No
dtype: object

Arithmetic Operations on Pandas Series

# We create a Pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])
# We display the fruits Pandas Series
fruits

apples 10
oranges 6
bananas 3
dtype: int64

# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
# We perform basic element-wise operations using arithmetic symbols
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2

Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64

fruits + 2:
apples 12
oranges 8
bananas 5
dtype: int64

fruits — 2:
apples 8
oranges 4
bananas 1
dtype: int64

fruits * 2:
apples 20
oranges 12
bananas 6
dtype: int64

fruits / 2:
apples 5.0
oranges 3.0
bananas 1.5
dtype: float64

# We import NumPy as np to be able to use the mathematical functions
import numpy as np
# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)
# We apply different mathematical functions to all elements of fruits
print('EXP(X) = \n', np.exp(fruits))
print('SQRT(X) =\n', np.sqrt(fruits))
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64

EXP(X) =
apples 22026.465795
oranges 403.428793
bananas 20.085537
dtype: float64

SQRT(X) =
apples 3.162278
oranges 2.449490
bananas 1.732051
dtype: float64

POW(X,2) =
apples 100
oranges 36
bananas 9
dtype: int64

# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64

Amount of bananas + 2 = 5

Amount of apples — 2 = 8

We double the amount of apples and oranges:
apples 20
oranges 12
dtype: int64

We half the amount of apples and oranges:
apples 5.0
oranges 3.0
dtype: float64

# We multiply our grocery list by 2
groceries * 2

eggs 60
apples 12
milk YesYes
bread NoNo
dtype: object

Creating Pandas DataFrames

# We import Pandas as pd into Python
import pandas as pd
# We create a dictionary of Pandas Series
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}
# We print the type of items to see that it is a dictionary
print(type(items))

class ‘dict’

# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)
# We display the DataFrame
shopping_carts

# 會回傳表格 — row: index label, column: keys of the dict

# We create a dictionary of Pandas Series without indexes
data = {'Bob' : pd.Series([245, 25, 55]),
'Alice' : pd.Series([40, 110, 500, 45])}
# We create a DataFrame
df = pd.DataFrame(data)
# We display the DataFrame
df
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
[[ 500. 245.]
[ 40. nan]
[ 110. nan]
[ 45. 25.]
[ nan 55.]]

The row index in shopping_carts is: Index([‘bike’, ‘book’, ‘glasses’, ‘pants’, ‘watch’], dtype=’object’)

The column index in shopping_carts is: Index([‘Alice’, ‘Bob’], dtype=’object’)

# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])
# We display bob_shopping_cart
bob_shopping_cart
# We Create a DataFrame that only has selected items for both Alice and Bob
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])
# We display sel_shopping_cart
sel_shopping_cart
# We Create a DataFrame that only has selected items for Alice
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])
# We display alice_sel_shopping_cart
alice_sel_shopping_cart
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}
# We create a DataFrame
df = pd.DataFrame(data)
# We display the DataFrame
df

Notice that since the data dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can, however, put labels to the row index by using the index keyword in the pd.DataFrame() function. Let's see an example

# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}
# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])
# We display the DataFrame
df
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
# We create a DataFrame
store_items = pd.DataFrame(items2)
# We display the DataFrame
store_items
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]
# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])
# We display the DataFrame
store_items

Accessing Elements in Pandas DataFrames

# We print the store_items DataFrame
print(store_items)
# We access rows, columns and elements using labels
print('How many bikes are in each store:\n', store_items[['bikes']])
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])

How many bikes are in each store:

How many bikes and pants are in each store:

What items are in Store 1:

How many bikes are in Store 2: 15

# We add a new column named shirts to our store_items DataFrame indicating the number of
# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
store_items['shirts'] = [15,2]
# We display the modified DataFrame
store_items
# We make a new column called suits by adding the number of shirts and pants
store_items['suits'] = store_items['pants'] + store_items['shirts']
# We display the modified DataFrame
store_items
# We create a dictionary from a list of Python dictionaries that will contain the number of different items at the new store
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]
# We create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items, index = ['store 3'])
# We display the items at the new store
new_store
# We append store 3 to our store_items DataFrame
store_items = store_items.append(new_store)
# We display the modified DataFrame
store_items
# We add a new column using data from particular rows in the watches column
store_items['new watches'] = store_items['watches'][1:]
# We display the modified DataFrame
store_items
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])
# we display the modified DataFrame
store_items
# We remove the new watches column
store_items.pop('new watches')
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})
# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')

Dealing with NaN

# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]
# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])
# We display the DataFrame
store_items
# We count the number of NaN values in store_items
x = store_items.isnull().sum().sum()
# We print x
print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: 3

store_items.isnull()
store_items.isnull().sum()

bikes 0
glasses 1
pants 0
shirts 1
shoes 0
suits 1
watches 0
dtype: int64

# We print the number of non-NaN values in our DataFrame
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())

Number of non-NaN values in the columns of our DataFrame:
bikes 3
glasses 2
pants 3
shirts 2
shoes 3
suits 2
watches 3
dtype: int64

Eliminating NaN Values

Tip: Remember, you learned that you can read axis = 0 as "down" and axis = 1 as "across" the given Numpy ndarray or Pandas dataframe object.

# We drop any rows with NaN values
store_items.dropna(axis = 0)
# We drop any columns with NaN values
store_items.dropna(axis = 1)
# We replace all NaN values with 0
store_items.fillna(0)
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)

Loading Data into a pandas DataFrame

# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./GOOG.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)
Google_stock.head()
Google_stock.tail()
Google_stock.isnull().any()
# We get descriptive statistics on our stock data
Google_stock.describe()
# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()
# We print information about our DataFrame  
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())
# We display the correlation between columns
Google_stock.corr()
# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')

data
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()

Year
1990 153000
1991 162000
1992 174000
Name: Salary, dtype: int64

# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()

Year
1990 51000
1991 54000
1992 58000
Name: Salary, dtype: int64

# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()

Name
Alice 162000
Bob 150000
Charlie 177000
Name: Salary, dtype: int64

--

--

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joe Chao

Joe Chao

3 Followers

理科與藝術交織成靈魂的會計人,喜愛戲劇與攝影,但也喜歡資料科學。