# Udacity Data Scientist Nanodegree : Prerequisite — Python(L7)

## Lesson 7: Pandas

# Intro

- Pandas incorporates two additional data structures into Python, namely
**Pandas Series**and**Pandas DataFrame**. These data structures allow us to work with*labeled*and*relational*data in an easy and intuitive manner. - Pandas Documentation

# Why use Pandas?

- One very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in.
- Based on NumPy

# Pandas Series

A Pandas series is a ** one-dimensional array**-like object that can hold many data types, such as numbers or strings, and has an option to provide axis labels.

## Difference between NumPy ndarrays and Pandas Series

- One of the main differences between Pandas Series and NumPy ndarrays is that
**you can assign an index label to each element in the Pandas Series.**In other words, you can name the indices of your Pandas Series anything you want. - Another big difference between Pandas Series and NumPy ndarrays is that
**Pandas Series can hold data of different data types.**

**import** pandas **as** pd # pd - convention

You can create Pandas Series by using the command `pd.Series(data, index)`

, where `index`

is a list of index labels.

## Example 1 — Create a Series

importpandasaspd# We create a Pandas Series that stores a grocery list

groceries = pd.Series(data = [30, 6, 'Yes', 'No'], index = ['eggs', 'apples', 'milk', 'bread'])# We display the Groceries Pandas Series

groceries

eggs 30

apples 6

milk Yes

bread No

dtype: object

Pandas Series have attributes that allow us to get information from the series in an easy way. Let’s see some of them:

## Example 2 — Print attributes — shape, ndim and size

*# We print some information about Groceries*

print('Groceries has shape:', groceries.shape)

print('Groceries has dimension:', groceries.ndim)

print('Groceries has a total of', groceries.size, 'elements')

Groceries has shape: (4,)

Groceries has dimension: 1

Groceries has a total of 4 elements

We can also print the index labels and the data of the Pandas Series separately. This is useful if you don’t happen to know what the index labels of the Pandas Series are.

## Example 3 — Print attributes — values, and index

*# We print the index and data of Groceries*

print('The data in Groceries is:', groceries.values)

print('The index of Groceries is:', groceries.index)

The data in Groceries is: [30 6 ‘Yes’ ‘No’]

The index of Groceries is: Index([‘eggs’, ‘apples’, ‘milk’, ‘bread’], dtype=’object’)

If you are dealing with a very large Pandas Series and if you are not sure whether an index label exists, you can check by using the `in`

command

## Example 4 — Check if an index is available in the given Series

# We check whether bananas is a food item (an index) in Groceries

x = 'bananas'ingroceries# We check whether bread is a food item (an index) in Groceries

y = 'bread'ingroceries# We print the results

print('Is bananas an index label in Groceries:', x)

print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False

Is bread an index label in Groceries: True

# Accessing and Deleting Elements in Pandas Series

- Elements can be accessed using
**index labels**or**numerical indices**inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. - Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes,
`.loc`

and`.iloc`

to explicitly state what we mean. - The attribute
`.loc`

stands forand it is used to explicitly state that we are using a labeled index. Similarly, the attribute*location*`.iloc`

stands forand it is used to explicitly state that we are using a numerical index. Let's see some examples:*integer location*

## Example 1. Access elements using index labels

# We access elements in Groceries using index labels:# We use a single index label

print('How many eggs do we need to buy:', groceries['eggs'])# we can access multiple index labels

print('Do we need milk and bread:\n', groceries[['milk', 'bread']])# we use loc to access multiple index labels

print('How many eggs and apples do we need to buy:\n', groceries.loc[['eggs', 'apples']])# We access elements in Groceries using numerical indices:# we use multiple numerical indices

print('How many eggs and apples do we need to buy:\n', groceries[[0, 1]])# We use a negative numerical index

print('Do we need bread:\n', groceries[[-1]])# We use a single numerical index

print('How many eggs do we need to buy:', groceries[0])# we use iloc to access multiple numerical indices

print('Do we need milk and bread:\n', groceries.iloc[[2, 3]])

How many eggs do we need to buy: 30

Do we need milk and bread:

milk Yes

bread No

dtype: object

How many eggs and apples do we need to buy:

eggs 30

apples 6

dtype: object

How many eggs and apples do we need to buy:

eggs 30

apples 6

dtype: object

Do we need bread:

bread No

dtype: object

How many eggs do we need to buy: 30

Do we need milk and bread:

milk Yes

bread No

dtype: object

Pandas Series are also **mutable** like NumPy ndarrays, which means we can change the elements of a Pandas Series after it has been created. For example, let’s change the number of eggs we need to buy from our grocery list

## Example 2. Mutate elements using index labels

# We display the original grocery list

print('Original Grocery List:\n', groceries)# We change the number of eggs to 2

groceries['eggs'] = 2# We display the changed grocery list

print()

print('Modified Grocery List:\n', groceries)

Original Grocery List:

eggs 30

apples 6

milk Yes

bread No

dtype: object

Modified Grocery List:

eggs 2

apples 6

milk Yes

bread No

dtype: object

We can also delete items from a Pandas Series by using the `.drop()`

method. The `Series.drop(label)`

method **removes** the given `label`

from the given `Series`

. We should note that the `Series.drop(label)`

method drops elements from the Series out-of-place, meaning that it doesn't change the original Series being modified. Let's see how this works:

## Example 3. Delete elements out-of-place using `drop()`

# We display the original grocery list

print('Original Grocery List:\n', groceries)# We remove apples from our grocery list. The drop function removes elements out of place

print('We remove apples (out of place):\n', groceries.drop('apples'))# When we remove elements out of place the original Series remains intact. To see this# we display our grocery list again

print('Grocery List after removing apples out of place:\n', groceries)

Original Grocery List:

eggs 30

apples 6

milk Yes

bread No

dtype: object

We remove apples (out of place):

eggs 30

milk Yes

bread No

dtype: object

Grocery List after removing apples out of place:

eggs 30

apples 6

milk Yes

bread No

dtype: object

We can delete items from a Pandas Series in place by setting the keyword `inplace`

to `True`

in the `.drop()`

method. Let's see an example:

## Example 4. Delete elements in-place using `drop()`

# We display the original grocery list

print('Original Grocery List:\n', groceries)# We remove apples from our grocery list in place by setting the inplace keyword to True

groceries.drop('apples', inplace =True)# When we remove elements in place the original Series its modified. To see this# we display our grocery list again

print()

print('Grocery List after removing apples in place:\n', groceries)

Original Grocery List:

eggs 30

apples 6

milk Yes

bread No

dtype: object

Grocery List after removing apples in place:

eggs 30

milk Yes

bread No

dtype: object

# Arithmetic Operations on Pandas Series

# We create a Pandas Series that stores a grocery list of just fruits

fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])# We display the fruits Pandas Series

fruits

apples 10

oranges 6

bananas 3

dtype: int64

We can now modify the data in fruits by performing basic arithmetic operations. Let’s see some examples

## Example 1. Element-wise basic arithmetic operations

# We print fruits for reference

print('Original grocery list of fruits:\n ', fruits)# We perform basic element-wise operations using arithmetic symbols

print('fruits + 2:\n', fruits + 2)# We add 2 to each item in fruits

print('fruits - 2:\n', fruits - 2)# We subtract 2 to each item in fruits

print('fruits * 2:\n', fruits * 2)# We multiply each item in fruits by 2

print('fruits / 2:\n', fruits / 2)# We divide each item in fruits by 2

Original grocery list of fruits:

apples 10

oranges 6

bananas 3

dtype: int64

fruits + 2:

apples 12

oranges 8

bananas 5

dtype: int64

fruits — 2:

apples 8

oranges 4

bananas 1

dtype: int64

fruits * 2:

apples 20

oranges 12

bananas 6

dtype: int64

fruits / 2:

apples 5.0

oranges 3.0

bananas 1.5

dtype: float64

You can also apply mathematical functions from NumPy, such as`sqrt(x)`

, to all elements of a Pandas Series.

## Example 2. Use mathematical functions from NumPy to operate on Series

# We import NumPy as np to be able to use the mathematical functionsimportnumpyasnp# We print fruits for reference

print('Original grocery list of fruits:\n', fruits)# We apply different mathematical functions to all elements of fruits

print('EXP(X) = \n', np.exp(fruits))

print('SQRT(X) =\n', np.sqrt(fruits))

print('POW(X,2) =\n',np.power(fruits,2))# We raise all elements of fruits to the power of 2

Original grocery list of fruits:

apples 10

oranges 6

bananas 3

dtype: int64

EXP(X) =

apples 22026.465795

oranges 403.428793

bananas 20.085537

dtype: float64

SQRT(X) =

apples 3.162278

oranges 2.449490

bananas 1.732051

dtype: float64

POW(X,2) =

apples 100

oranges 36

bananas 9

dtype: int64

Pandas also allows us to only apply arithmetic operations on selected items in our fruits grocery list. Let’s see some examples

## Example 3. Perform arithmetic operations on selected elements

# We print fruits for reference

print('Original grocery list of fruits:\n ', fruits)# We add 2 only to the bananas

print('Amount of bananas + 2 = ', fruits['bananas'] + 2)# We subtract 2 from apples

print('Amount of apples - 2 = ', fruits.iloc[0] - 2)# We multiply apples and oranges by 2

print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)# We divide apples and oranges by 2

print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Original grocery list of fruits:

apples 10

oranges 6

bananas 3

dtype: int64

Amount of bananas + 2 = 5

Amount of apples — 2 = 8

We double the amount of apples and oranges:

apples 20

oranges 12

dtype: int64

We half the amount of apples and oranges:

apples 5.0

oranges 3.0

dtype: float64

You can also apply arithmetic operations on Pandas Series of mixed data type provided that the arithmetic operation is defined for *all* data types in the Series, otherwise, you will get an error. Let’s see what happens when we multiply our grocery list by 2

## Example 4. Perform multiplication on a Series having integer and string elements

*# We multiply our grocery list by 2*

groceries * 2

eggs 60

apples 12

milk YesYes

bread NoNo

dtype: object

**Make sure the arithmetic operations are valid on all the data types of your elements.**

# Creating Pandas DataFrames

Pandas DataFrames are **two-dimensional **data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a **spreadsheet**. We can create Pandas DataFrames manually or by loading data from a file.

## Create a DataFrame manually

We will start by creating a DataFrame manually from a dictionary of Pandas Series. It is a two-step process:

- The first step is to create the dictionary of Pandas Series.
- After the dictionary is created we can then pass the dictionary to the
`pd.DataFrame()`

function.

We will create a dictionary that contains items purchased by two people, Alice and Bob, on an online store. The Pandas Series will use the price of the items purchased as *data*, and the purchased items will be used as the *index* labels to the Pandas Series. Let’s see how this done in code:

# We import Pandas as pd into Pythonimportpandasaspd# We create a dictionary of Pandas Series

items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),

'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}# We print the type of items to see that it is a dictionary

print(type(items))

class ‘dict’

Now that we have a dictionary, we are ready to create a DataFrame by passing it to the `pd.DataFrame()`

function. We will create a DataFrame that could represent the shopping carts of various users, in this case we have only two users, Alice and Bob.

## Example 1. Create a DataFrame using a dictionary of Series.

# We create a Pandas DataFrame by passing it a dictionary of Pandas Series

shopping_carts = pd.DataFrame(items)# We display the DataFrame

shopping_carts

# 會回傳表格 — row: index label, column: keys of the dict

There are several things to notice here, as explained below:

- We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in
**bold**. - Also, notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the
*keys*of the dictionary. - Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won’t happen when we load data into a DataFrame from a data file.
- The last thing we want to point out is that we see some
`NaN`

values appear in the DataFrame.**NaN****stands for**, and is Pandas way of indicating that it doesn't have a value for that particular row and column index.*Not a Number* - If we were to feed this data into a machine learning algorithm
**we will have to remove these****NaN****values first.**

## Example 2. DataFrame assigns the numerical row indexes by default.

# We create a dictionary of Pandas Series without indexes

data = {'Bob' : pd.Series([245, 25, 55]),

'Alice' : pd.Series([40, 110, 500, 45])}# We create a DataFrame

df = pd.DataFrame(data)# We display the DataFrame

df

We can see that Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

Now, just like with Pandas Series we can also extract information from DataFrames using attributes. Let’s print some information from our `shopping_carts`

DataFrame

## Example 3. Demonstrate a few attributes of DataFrame

*# We print some information about shopping_carts*

print('shopping_carts has shape:', shopping_carts.shape)

print('shopping_carts has dimension:', shopping_carts.ndim)

print('shopping_carts has a total of:', shopping_carts.size, 'elements')

print()

print('The data in shopping_carts is:\n', shopping_carts.values)

print()

print('The row index in shopping_carts is:', shopping_carts.index)

print()

print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)

shopping_carts has dimension: 2

shopping_carts has a total of: 10 elements

The data in shopping_carts is:

[[ 500. 245.]

[ 40. nan]

[ 110. nan]

[ 45. 25.]

[ nan 55.]]

The row index in shopping_carts is: Index([‘bike’, ‘book’, ‘glasses’, ‘pants’, ‘watch’], dtype=’object’)

The column index in shopping_carts is: Index([‘Alice’, ‘Bob’], dtype=’object’)

When creating the `shopping_carts`

DataFrame we passed the entire dictionary to the `pd.DataFrame()`

function. However, there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords `columns`

and `index`

. Let's see some examples:

# We Create a DataFrame that only has Bob's data

bob_shopping_cart = pd.DataFrame(items, columns=['Bob'])# We display bob_shopping_cart

bob_shopping_cart

## Example 4. Selecting specific rows of a DataFrame

# We Create a DataFrame that only has selected items for both Alice and Bob

sel_shopping_cart = pd.DataFrame(items, index = ['pants', 'book'])# We display sel_shopping_cart

sel_shopping_cart

## Example 5. Selecting specific columns of a DataFrame

# We Create a DataFrame that only has selected items for Alice

alice_sel_shopping_cart = pd.DataFrame(items, index = ['glasses', 'bike'], columns = ['Alice'])# We display alice_sel_shopping_cart

alice_sel_shopping_cart

You can also manually create DataFrames from a dictionary of lists (arrays). The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the `pd.DataFrame()`

function. In this case, however, all the lists (arrays) in the dictionary must be of the same length. Let' see an example:

## Example 6. Create a DataFrame using a dictionary of lists

# We create a dictionary of lists (arrays)

data = {'Integers' : [1,2,3],

'Floats' : [4.5, 8.2, 9.6]}# We create a DataFrame

df = pd.DataFrame(data)# We display the DataFrame

df

Notice that since the

`data`

dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. We can, however, put labels to the row index by using the`index`

keyword in the`pd.DataFrame()`

function. Let's see an example

## Example 7. Create a DataFrame using a dictionary of lists, and custom row-indexes (labels)

# We create a dictionary of lists (arrays)

data = {'Integers' : [1,2,3],

'Floats' : [4.5, 8.2, 9.6]}# We create a DataFrame and provide the row index

df = pd.DataFrame(data, index = ['label 1', 'label 2', 'label 3'])# We display the DataFrame

df

The last method for manually creating Pandas DataFrames that we want to look at is by using a list of Python dictionaries. The procedure is the same as before, we start by creating the dictionary and then passing the dictionary to the `pd.DataFrame()`

function.

## Example 8. Create a DataFrame using a of list of dictionaries

# We create a list of Python dictionaries

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},

{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]# We create a DataFrame

store_items = pd.DataFrame(items2)# We display the DataFrame

store_items

Again, notice that since the `items2`

dictionary we created doesn't have label indices, Pandas automatically uses numerical row indexes when it creates the DataFrame. As before, we can put labels to the row index by using the `index`

keyword in the `pd.DataFrame()`

function. Let's assume we are going to use this DataFrame to hold the number of items a particular store has in stock. So, we will label the row indices as **store 1** and **store 2**.

## Example 9. Create a DataFrame using a of list of dictionaries, and custom row-indexes (labels)

# We create a list of Python dictionaries

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35},

{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5}]# We create a DataFrame and provide the row index

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2'])# We display the DataFrame

store_items

# Accessing Elements in Pandas DataFrames

## Example 1. Access elements using labels

# We print the store_items DataFrame

print(store_items)# We access rows, columns and elements using labels

print('How many bikes are in each store:\n', store_items[['bikes']])

print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])

print('What items are in Store 1:\n', store_items.loc[['store 1']])

print('How many bikes are in Store 2:', store_items['bikes']['store 2'])

How many bikes are in each store:

How many bikes and pants are in each store:

What items are in Store 1:

How many bikes are in Store 2: 15

- In the form
`dataframe[column][row]`

## Example 2. Add a column to an existing DataFrame

# We add a new column named shirts to our store_items DataFrame indicating the number of# shirts in stock at each store. We will put 15 shirts in store 1 and 2 shirts in store 2

store_items['shirts'] = [15,2]# We display the modified DataFrame

store_items

We can see that when we add a new column, the new column is added at the **end** of our DataFrame.

## Example 3. Add a new column based on the arithmetic operation between existing columns of a DataFrame

# We make a new column called suits by adding the number of shirts and pants

store_items['suits'] = store_items['pants'] + store_items['shirts']# We display the modified DataFrame

store_items

## Example 4 a. Create a row to be added to the DataFrame

# We create a dictionary from a list of Python dictionaries that will contain the number of different items at the new store

new_items = [{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4}]# We create new DataFrame with the new_items and provide and index labeled store 3

new_store = pd.DataFrame(new_items, index = ['store 3'])# We display the items at the new store

new_store

## Example 4 b. Append the row to the DataFrame

# We append store 3 to our store_items DataFrame

store_items = store_items.append(new_store)# We display the modified DataFrame

store_items

Notice that by appending a new row to the DataFrame, the columns **have been put in alphabetical order.**

## Example 5. Add new column that has data from the existing columns

# We add a new column using data from particular rows in the watches column

store_items['new watches'] = store_items['watches'][1:]# We display the modified DataFrame

store_items

It is also possible, to insert new columns into the DataFrames anywhere we want. The `dataframe.insert(loc,label,data)`

method allows us to insert a new column in the `dataframe`

at location `loc`

, with the given column `label`

, and given `data`

. Let's add new column named **shoes** right before the **suits** column. Since **suits** has numerical index value 4 then we will use this value as `loc`

. Let's see how this works:

## Example 6. Add new column at a specific location

# We insert a new column with label shoes right before the column with numerical index 4

store_items.insert(4, 'shoes', [8,5,0])# we display the modified DataFrame

store_items

Just as we can add rows and columns we can also delete them. To delete rows and columns from our DataFrame we will use the `.pop()`

and `.drop()`

methods. The `.pop()`

method only allows us to delete columns, while the `.drop()`

method can be used to delete both rows and columns by use of the `axis`

keyword. Let's see some examples

## Example 7. Delete one column from a DataFrame

*# We remove the new watches column*

store_items.pop('new watches')

## Example 8. Delete multiple columns from a DataFrame

*# We remove the watches and shoes columns*

store_items = store_items.drop(['watches', 'shoes'], axis = 1)

## Example 9. Delete rows from a DataFrame

*# We remove the store 2 and store 1 rows*

store_items = store_items.drop(['store 2', 'store 1'], axis = 0)

Sometimes we might need to change the row and column labels. Let’s change the **bikes** column label to **hats** using the `.rename()`

method

## Example 10. Modify the column label

*# We change the column label bikes to hats*

store_items = store_items.rename(columns = {'bikes': 'hats'})

## Example 11. Modify the row label

*# We change the row label from store 3 to last store*

store_items = store_items.rename(index = {'store 3': 'last store'})

## Example 12. Use existing column values as row-index

*# We change the row index to be the data in the pants column*

store_items = store_items.set_index('pants')

# Dealing with NaN

As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to **clean the data** first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns `NaN`

values to missing data.

We will begin by creating a DataFrame with some `NaN`

values in it.

## Example 1. Create a DataFrame

# We create a list of Python dictionaries

items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},

{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},

{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]# We create a DataFrame and provide the row index

store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])# We display the DataFrame

store_items

We can clearly see that the DataFrame we created has 3 `NaN`

values: one in store 1 and two in store 3. However, in cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of `NaN`

values is not easily visualized. For these cases, we can use a **combination of methods to count the number of ****NaN**** values in our data**. The following example combines the `.isnull()`

and the `sum()`

methods to count the number of `NaN`

values in our DataFrame

## Example 2 a. Count the total NaN values

# We count the number of NaN values in store_items

x = store_items.isnull().sum().sum()# We print x

print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: 3

In the above example, the `.isnull()`

method returns a *Boolean* DataFrame of the same size as `store_items`

and indicates with `True`

the elements that have `NaN`

values and with `False`

the elements that are not. Let's see an example:

## Example 2 b. Return boolean True/False for each element if it is a NaN

`store_items.isnull()`

In Pandas, logical `True`

values have numerical value 1 and logical `False`

values have numerical value 0. Therefore, we can count the number of `NaN`

values by counting the number of logical `True`

values. In order to count the total number of logical `True`

values we use the `.sum()`

method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical `True`

values along columns, as we see below:

## Example 2 c. Count NaN down the column.

`store_items.isnull().sum()`

bikes 0

glasses 1

pants 0

shirts 1

shoes 0

suits 1

watches 0

dtype: int64

Instead of counting the number of `NaN`

values we can also do the opposite, we can count the number of *non-NaN* values. We can do this by using the `.count()`

method as shown below:

## Example 3. Count the total non-NaN values

*# We print the number of non-NaN values in our DataFrame*

print('Number of non-NaN values in the columns of our DataFrame:\n', store_items.count())

Number of non-NaN values in the columns of our DataFrame:

bikes 3

glasses 2

pants 3

shirts 2

shoes 3

suits 2

watches 3

dtype: int64

# Eliminating NaN Values

Now that we learned how to know if our dataset has any `NaN`

values in it, the next step is to decide what to do with them. In general, we have two options, we can either** delete or replace** the

`NaN`

values. In the following examples, we will show you how to do both.We will start by learning how to eliminate rows or columns from our DataFrame that contain any `NaN`

values. The `.dropna(axis)`

method eliminates any *rows* with `NaN`

values when `axis = 0`

is used and will eliminate any *columns* with `NaN`

values when `axis = 1`

is used.

Tip: Remember, you learned that you can read

axis = 0as "down" and

axis = 1as "across" the given Numpy ndarray or Pandas dataframe object.

Let’s see some examples.

## Example 4. Drop rows having NaN values

*# We drop any rows with NaN values*

store_items.dropna(axis = 0)

## Example 5. Drop columns having NaN values

*# We drop any columns with NaN values*

store_items.dropna(axis = 1)

Notice that the `.dropna()`

method eliminates (drops) the rows or columns with `NaN`

values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword `inplace = True`

inside the `dropna()`

function.

## Substituting NaN Values

Now, instead of eliminating `NaN`

values, we can replace them with suitable values. We could choose for example to replace all `NaN`

values with the value 0. We can do this by using the `.fillna()`

method as shown below.

## Example 6. Replace NaN with 0

*# We replace all NaN values with 0*

store_items.fillna(0)

We can also use the `.fillna()`

method to replace `NaN`

values with previous values in the DataFrame, this is known as ** forward filling**. When replacing

`NaN`

values with forward filling, we can use previous values taken from columns or rows. The `.fillna(method = 'ffill', axis)`

will use the forward filling (`ffill`

) method to replace `NaN`

values using the previous known value along the given `axis`

. Let's see some examples:## Example 7. Forward fill NaN values down (axis = 0) the dataframe

*# We replace NaN values with the previous value in the column*

store_items.fillna(method = 'ffill', axis = 0)

Notice that the two `NaN`

values in **store 3** have been replaced with previous values in their columns. However, notice that the `NaN`

value in **store 1** didn't get replaced. That's because there are no previous values in this column, since the `NaN`

value is the first value in that column. However, if we do forward fill using the previous row values, this won't happen. Let's take a look:

## Example 8. Forward fill NaN values across (axis = 1) the dataframe

*# We replace NaN values with the previous value in the row*

store_items.fillna(method = 'ffill', axis = 1)

We see that in this case all the `NaN`

values have been replaced with the previous row values.

Similarly, you can choose to replace the `NaN`

values with the values that go after them in the DataFrame, this is known as *backward filling*. The `.fillna(method = 'backfill', axis)`

will use the backward filling (`backfill`

) method to replace `NaN`

values using the next known value along the given `axis`

. Just like with forward filling we can choose to use row or column values. Let's see some examples:

## Example 9. Backward fill NaN values down (axis = 0) the dataframe

*# We replace NaN values with the next value in the column*

store_items.fillna(method = 'backfill', axis = 0)

Notice that the `NaN`

value in **store 1** has been replaced with the next value in its column. However, notice that the two `NaN`

values in **store 3** didn't get replaced. That's because there are no next values in these columns, since these `NaN`

values are the last values in those columns. However, if we do backward fill using the next row values, this won't happen. Let's take a look:

## Example 10. Backward fill NaN values across (axis = 1) the dataframe

*# We replace NaN values with the next value in the row*

store_items.fillna(method = 'backfill', axis = 1)

Notice that the `.fillna()`

method replaces (fills) the `NaN`

values out of place. This means that the original DataFrame is not modified. You can always replace the `NaN`

values in place by setting the keyword `inplace = True`

inside the `fillna()`

function.

We can also choose to replace `NaN`

values by using different interpolation methods. For example, the `.interpolate(method = 'linear', axis)`

method will use `linear`

interpolation to replace `NaN`

values using the values along the given `axis`

. Let's see some examples:

## Example 11. Interpolate (estimate) NaN values down (axis = 0) the dataframe

*# We replace NaN values by using linear interpolation using column values*

store_items.interpolate(method = 'linear', axis = 0)

Notice that the two `NaN`

values in **store 3** have been replaced with linear interpolated values. However, notice that the `NaN`

value in **store 1** didn't get replaced. That's because the `NaN`

value is the first value in that column, and since there is no data before it, the interpolation function can't calculate a value. Now, let's interpolate using row values instead:

## Example 12. Interpolate (estimate) NaN values across (axis = 1) the dataframe

*# We replace NaN values by using linear interpolation using row values*

store_items.interpolate(method = 'linear', axis = 1)

Just as with the other methods we saw, the `.interpolate()`

method replaces `NaN`

values out of place.

# Loading Data into a pandas DataFrame

- CSV stands for
*Comma Separated Values*and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the`pd.read_csv()`

function.

*# We load Google stock data in a DataFrame*

Google_stock = pd.read_csv('./GOOG.csv')

*# We print some information about Google_stock*

print('Google_stock is of type:', type(Google_stock))

print('Google_stock has shape:', Google_stock.shape)

Google_stock is of type: class ‘pandas.core.frame.DataFrame’

Google_stock has shape: (3313, 7)

## Example 2. Look at the first 5 rows of the DataFrame

`Google_stock.head()`

## Example 3. Look at the last 5 rows of the DataFrame

`Google_stock.tail()`

We can also optionally use `.head(N)`

or `.tail(N)`

to display the first and last `N`

rows of data, respectively.

Let’s do a quick check to see whether we have any `NaN`

values in our dataset. To do this, we will use the `.isnull()`

method followed by the `.any()`

method to check whether any of the columns contain `NaN`

values.

## Example 4. Check if any column contains a NaN. Returns a boolean for each column label.

`Google_stock.isnull().any()`

## Example 5. See the descriptive statistics of the DataFrame

*# We get descriptive statistics on our stock data*

Google_stock.describe()

## Example 6. See the descriptive statistics of one of the columns of the DataFrame

*# We get descriptive statistics on a single column of our DataFrame*

Google_stock['Adj Close'].describe()

## Example 7. Statistical operations — Min, Max, and Mean

*# We print information about our DataFrame *

print()

print('Maximum values of each column:\n', Google_stock.max())

print()

print('Minimum Close value:', Google_stock['Close'].min())

print()

print('Average value of each column:\n', Google_stock.mean())

## Example 8. Statistical operation — Correlation

*# We display the correlation between columns*

Google_stock.corr()

`groupby()`

method

- The
`.groupby()`

method allows us to group data in different ways.

*# We load fake Company data in a DataFrame*

data = pd.read_csv('./fake_company.csv')

data

*# We display the total amount of money spent in salaries each year*

data.groupby(['Year'])['Salary'].sum()

Year

1990 153000

1991 162000

1992 174000

Name: Salary, dtype: int64

## Example 11. Demonstrate `groupby()`

and `mean()`

method

Now, let’s suppose I want to know what was the average salary for each year. In this case, we will group the data by *Year* using the `.groupby()`

method, just as we did before, and then we use the `.mean()`

method to get the average salary. Let's see how this works

*# We display the average salary per year*

data.groupby(['Year'])['Salary'].mean()

Year

1990 51000

1991 54000

1992 58000

Name: Salary, dtype: int64

## Example 12. Demonstrate `groupby()`

on single column

Now let’s see how much did each employee gets paid in those three years. In this case, we will group the data by *Name* using the `.groupby()`

method and then we will add up the salaries for each year. Let's see the result

*# We display the total salary each employee received in all the years they worked for the company*

data.groupby(['Name'])['Salary'].sum()

Name

Alice 162000

Bob 150000

Charlie 177000

Name: Salary, dtype: int64