Udacity Data Scientist Nanodegree : Prerequisite — Python(L7)
Intro
- Pandas incorporates two additional data structures into Python, namely Pandas Series and Pandas DataFrame. These data structures allow us to work with labeled and relational data in an easy and intuitive manner.
- Pandas Documentation
Why use Pandas?
- One very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in.
- Based on NumPy
Pandas Series
A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings, and has an option to provide axis labels.
Difference between NumPy ndarrays and Pandas Series
- One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want.
- Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.
import pandas as pd # pd - convention
You can create Pandas Series by using the command pd.Series(data, index)
, where index
is a list of index labels.
Example 1 — Create a Series
import pandas as pd# We create a Pandas Series that stores a grocery list
groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])# We display the Groceries Pandas Series
groceries
eggs 30
apples 6
milk Yes
bread No
dtype: object
Pandas Series have attributes that allow us to get information from the series in an easy way. Let’s see some of them:
Example 2 — Print attributes — shape, ndim and size
# We print some information about Groceries
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')
Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements
We can also print the index labels and the data of the Pandas Series separately. This is useful if you don’t happen to know what the index labels of the Pandas Series are.
Example 3 — Print attributes — values, and index
# We print the index and data of Groceries
print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)
The data in Groceries is: [30 6 ‘Yes’ ‘No’]
The index of Groceries is: Index([‘eggs’, ‘apples’, ‘milk’, ‘bread’], dtype=’object’)
If you are dealing with a very large Pandas Series and if you are not sure whether an index label exists, you can check by using the in
command
Example 4 — Check if an index is available in the given Series
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)
Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True
Accessing and Deleting Elements in Pandas Series
- Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays.
- Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes,
.loc
and.iloc
to explicitly state what we mean. - The attribute
.loc
stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute.iloc
stands for integer location and it is used to explicitly state that we are using a numerical index. Let's see some examples:
Example 1. Access elements using index labels
# We access elements in Groceries using index labels:# We use a single index label
print('How many eggs do we need to buy:', groceries['eggs'])# we can access multiple index labels
print('Do we need milk and bread:\n', groceries[['milk', 'bread']]) # we use loc to access multiple index labels
print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']]) # We access elements in Groceries using numerical indices:# we use multiple numerical indices
print('How many eggs and apples do we need to buy:\n', groceries[[0, 1]]) # We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) # We use a single numerical index
print('How many eggs do we need to buy:', groceries[0]) # we use iloc to access multiple numerical indices
print('Do we need milk and bread:\n', groceries.iloc[[2, 3]])
How many eggs do we need to buy: 30
Do we need milk and bread:
milk Yes
bread No
dtype: objectHow many eggs and apples do we need to buy:
eggs 30
apples 6
dtype: objectHow many eggs and apples do we need to buy:
eggs 30
apples 6
dtype: objectDo we need bread:
bread No
dtype: objectHow many eggs do we need to buy: 30
Do we need milk and bread:
milk Yes
bread No
dtype: object
Pandas Series are also mutable like NumPy ndarrays, which means we can change the elements of a Pandas Series after it has been created. For example, let’s change the number of eggs we need to buy from our grocery list
Example 2. Mutate elements using index labels
# We display the original grocery list
print('Original Grocery List:\n', groceries)# We change the number of eggs to 2
groceries['eggs'] = 2# We display the changed grocery list
print()
print('Modified Grocery List:\n', groceries)
Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: objectModified Grocery List:
eggs 2
apples 6
milk Yes
bread No
dtype: object
We can also delete items from a Pandas Series by using the .drop()
method. The Series.drop(label)
method removes the given label
from the given Series
. We should note that the Series.drop(label)
method drops elements from the Series out-of-place, meaning that it doesn't change the original Series being modified. Let's see how this works:
Example 3. Delete elements out-of-place using drop()
# We display the original grocery list
print('Original Grocery List:\n', groceries)# We remove apples from our grocery list. The drop function removes elements out of place
print('We remove apples (out of place):\n', groceries.drop('apples'))# When we remove elements out of place the original Series remains intact. To see this
# we display our grocery list again
print('Grocery List after removing apples out of place:\n', groceries)
Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: objectWe remove apples (out of place):
eggs 30
milk Yes
bread No
dtype: objectGrocery List after removing apples out of place:
eggs 30
apples 6
milk Yes
bread No
dtype: object
We can delete items from a Pandas Series in place by setting the keyword inplace
to True
in the .drop()
method. Let's see an example:
Example 4. Delete elements in-place using drop()
# We display the original grocery list
print('Original Grocery List:\n', groceries)# We remove apples from our grocery list in place by setting the inplace keyword to True
groceries.drop('apples', inplace = True)# When we remove elements in place the original Series its modified. To see this
# we display our grocery list again
print()
print('Grocery List after removing apples in place:\n', groceries)
Original Grocery List:
eggs 30
apples 6
milk Yes
bread No
dtype: objectGrocery List after removing apples in place:
eggs 30
milk Yes
bread No
dtype: object
Arithmetic Operations on Pandas Series
# We create a Pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])# We display the fruits Pandas Series
fruits
apples 10
oranges 6
bananas 3
dtype: int64
We can now modify the data in fruits by performing basic arithmetic operations. Let’s see some examples
Example 1. Element-wise basic arithmetic operations
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)# We perform basic element-wise operations using arithmetic symbols
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64fruits + 2:
apples 12
oranges 8
bananas 5
dtype: int64fruits — 2:
apples 8
oranges 4
bananas 1
dtype: int64fruits * 2:
apples 20
oranges 12
bananas 6
dtype: int64fruits / 2:
apples 5.0
oranges 3.0
bananas 1.5
dtype: float64
You can also apply mathematical functions from NumPy, such assqrt(x)
, to all elements of a Pandas Series.
Example 2. Use mathematical functions from NumPy to operate on Series
# We import NumPy as np to be able to use the mathematical functions
import numpy as np# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)# We apply different mathematical functions to all elements of fruits
print('EXP(X) = \n', np.exp(fruits))
print('SQRT(X) =\n', np.sqrt(fruits))
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2
Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64EXP(X) =
apples 22026.465795
oranges 403.428793
bananas 20.085537
dtype: float64SQRT(X) =
apples 3.162278
oranges 2.449490
bananas 1.732051
dtype: float64POW(X,2) =
apples 100
oranges 36
bananas 9
dtype: int64
Pandas also allows us to only apply arithmetic operations on selected items in our fruits grocery list. Let’s see some examples
Example 3. Perform arithmetic operations on selected elements
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)
Original grocery list of fruits:
apples 10
oranges 6
bananas 3
dtype: int64Amount of bananas + 2 = 5
Amount of apples — 2 = 8
We double the amount of apples and oranges:
apples 20
oranges 12
dtype: int64We half the amount of apples and oranges:
apples 5.0
oranges 3.0
dtype: float64
You can also apply arithmetic operations on Pandas Series of mixed data type provided that the arithmetic operation is defined for all data types in the Series, otherwise, you will get an error. Let’s see what happens when we multiply our grocery list by 2
Example 4. Perform multiplication on a Series having integer and string elements
# We multiply our grocery list by 2
groceries * 2
eggs 60
apples 12
milk YesYes
bread NoNo
dtype: object
Make sure the arithmetic operations are valid on all the data types of your elements.
Creating Pandas DataFrames
Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file.
Create a DataFrame manually
We will start by creating a DataFrame manually from a dictionary of Pandas Series. It is a two-step process:
- The first step is to create the dictionary of Pandas Series.
- After the dictionary is created we can then pass the dictionary to the
pd.DataFrame()
function.
We will create a dictionary that contains items purchased by two people, Alice and Bob, on an online store. The Pandas Series will use the price of the items purchased as data, and the purchased items will be used as the index labels to the Pandas Series. Let’s see how this done in code:
# We import Pandas as pd into Python
import pandas as pd# We create a dictionary of Pandas Series
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}# We print the type of items to see that it is a dictionary
print(type(items))
class ‘dict’
Now that we have a dictionary, we are ready to create a DataFrame by passing it to the pd.DataFrame()
function. We will create a DataFrame that could represent the shopping carts of various users, in this case we have only two users, Alice and Bob.
Example 1. Create a DataFrame using a dictionary of Series.
# We create a Pandas DataFrame by passing it a dictionary of Pandas Series
shopping_carts = pd.DataFrame(items)# We display the DataFrame
shopping_carts
# 會回傳表格 — row: index label, column: keys of the dict
There are several things to notice here, as explained below:
- We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold.
- Also, notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary.
- Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won’t happen when we load data into a DataFrame from a data file.
- The last thing we want to point out is that we see some
NaN
values appear in the DataFrame.NaN
stands for Not a Number, and is Pandas way of indicating that it doesn't have a value for that particular row and column index. - If we were to feed this data into a machine learning algorithm we will have to remove these
NaN
values first.
Example 2. DataFrame assigns the numerical row indexes by default.
# We create a dictionary of Pandas Series without indexes
data = {'Bob' : pd.Series([245, 25, 55]),
'Alice' : pd.Series([40, 110, 500, 45])}# We create a DataFrame
df = pd.DataFrame(data)# We display the DataFrame
df
We can see that Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.
Now, just like with Pandas Series we can also extract information from DataFrames using attributes. Let’s print some information from our shopping_carts
DataFrame
Example 3. Demonstrate a few attributes of DataFrame
# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)
shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elementsThe data in shopping_carts is:
[[ 500. 245.]
[ 40. nan]
[ 110. nan]
[ 45. 25.]
[ nan 55.]]The row index in shopping_carts is: Index([‘bike’, ‘book’, ‘glasses’, ‘pants’, ‘watch’], dtype=’object’)
The column index in shopping_carts is: Index([‘Alice’, ‘Bob’], dtype=’object’)
When creating the shopping_carts
DataFrame we passed the entire dictionary to the pd.DataFrame()
function. However, there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords columns
and index
. Let's see some examples:
# We Create a DataFrame that only has Bob's data
bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])# We display bob_shopping_cart
bob_shopping_cart
Example 4. Selecting specific rows of a DataFrame
# We Create a DataFrame that only has selected items for both Alice and Bob
sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])# We display sel_shopping_cart
sel_shopping_cart
Example 5. Selecting specific columns of a DataFrame
# We Create a DataFrame that only has selected items for Alice
alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])# We display alice_sel_shopping_cart
alice_sel_shopping_cart
You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame()
function. In this case, however, all the lists (arrays) in the dictionary must be of the same length. Let' see an example:
Example 6. Create a DataFrame using a dictionary of lists
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}# We create a DataFrame
df = pd.DataFrame(data)# We display the DataFrame
df
Notice that since the
data
dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can, however, put labels to the row index by using theindex
keyword in thepd.DataFrame()
function. Let's see an example
Example 7. Create a DataFrame using a dictionary of lists, and custom row-indexes (labels)
# We create a dictionary of lists (arrays)
data = {'Integers' : [1,2,3],
'Floats' : [4.5, 8.2, 9.6]}# We create a DataFrame and provide the row index
df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])# We display the DataFrame
df
The last method for manually creating Pandas DataFrames that we want to look at is by using a list of Python dictionaries. The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the pd.DataFrame()
function.
Example 8. Create a DataFrame using a of list of dictionaries
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]# We create a DataFrame
store_items = pd.DataFrame(items2)# We display the DataFrame
store_items
Again, notice that since the items2
dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. As before, we can put labels to the row index by using the index
keyword in the pd.DataFrame()
function. Let's assume we are going to use this DataFrame to hold the number of items a particular store has in stock. So, we will label the row indices as store 1 and store 2.
Example 9. Create a DataFrame using a of list of dictionaries, and custom row-indexes (labels)
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])# We display the DataFrame
store_items
Accessing Elements in Pandas DataFrames
Example 1. Access elements using labels
# We print the store_items DataFrame
print(store_items)# We access rows, columns and elements using labels
print('How many bikes are in each store:\n', store_items[['bikes']])
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print('What items are in Store 1:\n', store_items.loc[['store 1']])
print('How many bikes are in Store 2:', store_items['bikes']['store 2'])
How many bikes are in each store:
How many bikes and pants are in each store:
What items are in Store 1:
How many bikes are in Store 2: 15
- In the form
dataframe[column][row]
Example 2. Add a column to an existing DataFrame
# We add a new column named shirts to our store_items DataFrame indicating the number of
# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2
store_items['shirts'] = [15,2]# We display the modified DataFrame
store_items
We can see that when we add a new column, the new column is added at the end of our DataFrame.
Example 3. Add a new column based on the arithmetic operation between existing columns of a DataFrame
# We make a new column called suits by adding the number of shirts and pants
store_items['suits'] = store_items['pants'] + store_items['shirts']# We display the modified DataFrame
store_items
Example 4 a. Create a row to be added to the DataFrame
# We create a dictionary from a list of Python dictionaries that will contain the number of different items at the new store
new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]# We create new DataFrame with the new_items and provide and index labeled store 3
new_store = pd.DataFrame(new_items, index = ['store 3'])# We display the items at the new store
new_store
Example 4 b. Append the row to the DataFrame
# We append store 3 to our store_items DataFrame
store_items = store_items.append(new_store)# We display the modified DataFrame
store_items
Notice that by appending a new row to the DataFrame, the columns have been put in alphabetical order.
Example 5. Add new column that has data from the existing columns
# We add a new column using data from particular rows in the watches column
store_items['new watches'] = store_items['watches'][1:]# We display the modified DataFrame
store_items
It is also possible, to insert new columns into the DataFrames anywhere we want. The dataframe.insert(loc,label,data)
method allows us to insert a new column in the dataframe
at location loc
, with the given column label
, and given data
. Let's add new column named shoes right before the suits column. Since suits has numerical index value 4 then we will use this value as loc
. Let's see how this works:
Example 6. Add new column at a specific location
# We insert a new column with label shoes right before the column with numerical index 4
store_items.insert(4, 'shoes', [8,5,0])# we display the modified DataFrame
store_items
Just as we can add rows and columns we can also delete them. To delete rows and columns from our DataFrame we will use the .pop()
and .drop()
methods. The .pop()
method only allows us to delete columns, while the .drop()
method can be used to delete both rows and columns by use of the axis
keyword. Let's see some examples
Example 7. Delete one column from a DataFrame
# We remove the new watches column
store_items.pop('new watches')
Example 8. Delete multiple columns from a DataFrame
# We remove the watches and shoes columns
store_items = store_items.drop(['watches', 'shoes'], axis = 1)
Example 9. Delete rows from a DataFrame
# We remove the store 2 and store 1 rows
store_items = store_items.drop(['store 2', 'store 1'], axis = 0)
Sometimes we might need to change the row and column labels. Let’s change the bikes column label to hats using the .rename()
method
Example 10. Modify the column label
# We change the column label bikes to hats
store_items = store_items.rename(columns = {'bikes': 'hats'})
Example 11. Modify the row label
# We change the row label from store 3 to last store
store_items = store_items.rename(index = {'store 3': 'last store'})
Example 12. Use existing column values as row-index
# We change the row index to be the data in the pants column
store_items = store_items.set_index('pants')
Dealing with NaN
As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns NaN
values to missing data.
We will begin by creating a DataFrame with some NaN
values in it.
Example 1. Create a DataFrame
# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]# We create a DataFrame and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])# We display the DataFrame
store_items
We can clearly see that the DataFrame we created has 3 NaN
values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaN
values is not easily visualized. For these cases, we can use a combination of methods to count the number of NaN
values in our data. The following example combines the .isnull()
and the sum()
methods to count the number of NaN
values in our DataFrame
Example 2 a. Count the total NaN values
# We count the number of NaN values in store_items
x = store_items.isnull().sum().sum()# We print x
print('Number of NaN values in our DataFrame:', x)
Number of NaN values in our DataFrame: 3
In the above example, the .isnull()
method returns a Boolean DataFrame of the same size as store_items
and indicates with True
the elements that have NaN
values and with False
the elements that are not. Let's see an example:
Example 2 b. Return boolean True/False for each element if it is a NaN
store_items.isnull()
In Pandas, logical True
values have numerical value 1 and logical False
values have numerical value 0. Therefore, we can count the number of NaN
values by counting the number of logical True
values. In order to count the total number of logical True
values we use the .sum()
method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True
values along columns, as we see below:
Example 2 c. Count NaN down the column.
store_items.isnull().sum()
bikes 0
glasses 1
pants 0
shirts 1
shoes 0
suits 1
watches 0
dtype: int64
Instead of counting the number of NaN
values we can also do the opposite, we can count the number of non-NaN values. We can do this by using the .count()
method as shown below:
Example 3. Count the total non-NaN values
# We print the number of non-NaN values in our DataFrame
print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())
Number of non-NaN values in the columns of our DataFrame:
bikes 3
glasses 2
pants 3
shirts 2
shoes 3
suits 2
watches 3
dtype: int64
Eliminating NaN Values
Now that we learned how to know if our dataset has any NaN
values in it, the next step is to decide what to do with them. In general, we have two options, we can either delete or replace the NaN
values. In the following examples, we will show you how to do both.
We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN
values. The .dropna(axis)
method eliminates any rows with NaN
values when axis = 0
is used and will eliminate any columns with NaN
values when axis = 1
is used.
Tip: Remember, you learned that you can read
axis = 0
as "down" andaxis = 1
as "across" the given Numpy ndarray or Pandas dataframe object.
Let’s see some examples.
Example 4. Drop rows having NaN values
# We drop any rows with NaN values
store_items.dropna(axis = 0)
Example 5. Drop columns having NaN values
# We drop any columns with NaN values
store_items.dropna(axis = 1)
Notice that the .dropna()
method eliminates (drops) the rows or columns with NaN
values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword inplace = True
inside the dropna()
function.
Substituting NaN Values
Now, instead of eliminating NaN
values, we can replace them with suitable values. We could choose for example to replace all NaN
values with the value 0. We can do this by using the .fillna()
method as shown below.
Example 6. Replace NaN with 0
# We replace all NaN values with 0
store_items.fillna(0)
We can also use the .fillna()
method to replace NaN
values with previous values in the DataFrame, this is known as forward filling. When replacing NaN
values with forward filling, we can use previous values taken from columns or rows. The .fillna(method = 'ffill', axis)
will use the forward filling (ffill
) method to replace NaN
values using the previous known value along the given axis
. Let's see some examples:
Example 7. Forward fill NaN values down (axis = 0) the dataframe
# We replace NaN values with the previous value in the column
store_items.fillna(method = 'ffill', axis = 0)
Notice that the two NaN
values in store 3 have been replaced with previous values in their columns. However, notice that the NaN
value in store 1 didn't get replaced. That's because there are no previous values in this column, since the NaN
value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:
Example 8. Forward fill NaN values across (axis = 1) the dataframe
# We replace NaN values with the previous value in the row
store_items.fillna(method = 'ffill', axis = 1)
We see that in this case all the NaN
values have been replaced with the previous row values.
Similarly, you can choose to replace the NaN
values with the values that go after them in the DataFrame, this is known as backward filling. The .fillna(method = 'backfill', axis)
will use the backward filling (backfill
) method to replace NaN
values using the next known value along the given axis
. Just like with forward filling we can choose to use row or column values. Let's see some examples:
Example 9. Backward fill NaN values down (axis = 0) the dataframe
# We replace NaN values with the next value in the column
store_items.fillna(method = 'backfill', axis = 0)
Notice that the NaN
value in store 1 has been replaced with the next value in its column. However, notice that the two NaN
values in store 3 didn't get replaced. That's because there are no next values in these columns, since these NaN
values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:
Example 10. Backward fill NaN values across (axis = 1) the dataframe
# We replace NaN values with the next value in the row
store_items.fillna(method = 'backfill', axis = 1)
Notice that the .fillna()
method replaces (fills) the NaN
values out of place. This means that the original DataFrame is not modified. You can always replace the NaN
values in place by setting the keyword inplace = True
inside the fillna()
function.
We can also choose to replace NaN
values by using different interpolation methods. For example, the .interpolate(method = 'linear', axis)
method will use linear
interpolation to replace NaN
values using the values along the given axis
. Let's see some examples:
Example 11. Interpolate (estimate) NaN values down (axis = 0) the dataframe
# We replace NaN values by using linear interpolation using column values
store_items.interpolate(method = 'linear', axis = 0)
Notice that the two NaN
values in store 3 have been replaced with linear interpolated values. However, notice that the NaN
value in store 1 didn't get replaced. That's because the NaN
value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:
Example 12. Interpolate (estimate) NaN values across (axis = 1) the dataframe
# We replace NaN values by using linear interpolation using row values
store_items.interpolate(method = 'linear', axis = 1)
Just as with the other methods we saw, the .interpolate()
method replaces NaN
values out of place.
Loading Data into a pandas DataFrame
- CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the
pd.read_csv()
function.
# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./GOOG.csv')
# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)
Google_stock is of type: class ‘pandas.core.frame.DataFrame’
Google_stock has shape: (3313, 7)
Example 2. Look at the first 5 rows of the DataFrame
Google_stock.head()
Example 3. Look at the last 5 rows of the DataFrame
Google_stock.tail()
We can also optionally use .head(N)
or .tail(N)
to display the first and last N
rows of data, respectively.
Let’s do a quick check to see whether we have any NaN
values in our dataset. To do this, we will use the .isnull()
method followed by the .any()
method to check whether any of the columns contain NaN
values.
Example 4. Check if any column contains a NaN. Returns a boolean for each column label.
Google_stock.isnull().any()
Example 5. See the descriptive statistics of the DataFrame
# We get descriptive statistics on our stock data
Google_stock.describe()
Example 6. See the descriptive statistics of one of the columns of the DataFrame
# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()
Example 7. Statistical operations — Min, Max, and Mean
# We print information about our DataFrame
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())
Example 8. Statistical operation — Correlation
# We display the correlation between columns
Google_stock.corr()
groupby()
method
- The
.groupby()
method allows us to group data in different ways.
# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')
data
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()
Year
1990 153000
1991 162000
1992 174000
Name: Salary, dtype: int64
Example 11. Demonstrate groupby()
and mean()
method
Now, let’s suppose I want to know what was the average salary for each year. In this case, we will group the data by Year using the .groupby()
method, just as we did before, and then we use the .mean()
method to get the average salary. Let's see how this works
# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()
Year
1990 51000
1991 54000
1992 58000
Name: Salary, dtype: int64
Example 12. Demonstrate groupby()
on single column
Now let’s see how much did each employee gets paid in those three years. In this case, we will group the data by Name using the .groupby()
method and then we will add up the salaries for each year. Let's see the result
# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()
Name
Alice 162000
Bob 150000
Charlie 177000
Name: Salary, dtype: int64