This article is about the udacity nanodegree project1.
The Description from Kaggle(Data Resource)
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.
Content The following Airbnb activity is included in this Seattle dataset:
- Listings, including full descriptions and average review score
- Reviews, including unique id for each reviewer and detailed comments
- Calendar, including listing id and the price and availability for that day
First Step: Data understanding. Try to define the questions.
Question 1: What’s the relation between position and price?
By searching, I found the Seattle train station is King Street Station. Then I use this website to get its latitude and longitude. I’ll set the King Street Station as the center. By getting its latitude and longitude, I can calculate the distance between the airbnb’s position and the station.
This is like Eucildean distance:
By using the same idea, I can calculate the relative distance(Not the absolute distance. But if we want, we can convert it to absolute distance)
Then the code is following:
# this step is for variable definition and convert the 'price' column to float(It was object initially)
# Also, because the price has dollar sign and decimal point, I'll remove it in this block of code.king_station_position = [47.598330, -122.311640]
latitude = []
longtitude = []
distance = []
listing['price'] = (listing['price'].replace('[\$,)]', '', regex=True).replace('[(]', '-', regex=True)).astype(float)
Then we use the describe function to obtain the statistical description. As the following, I found some data have outlier by the mean plus or minus 3 standard deviation.
# We're using the price column and there's no NaN value. - No need to data cleaning.
# remove outlier, mean +/- 3 std
delete = []
for i in range(len(listing['price'])):
if listing['price'].iloc[i] > (listing['price'].mean() + 3 * listing['price'].std()) or listing['price'].iloc[i] < (listing['price'].mean() - 3 * listing['price'].std()):
delete.append(listing['price'].index[i])
listing = listing.drop(delete)
After data cleaning, the value is:
The mean in figure(1) is almost 128, and the std is about 90.
128 + 90 * 3 = 398
The code did remove the outlier from the data.
Afterward, we are going to calculate the distance:
# the distance between the airbnb's position and the station
for index in range(len(listing)):
latitude.append(abs(king_station_position[0] - listing['latitude'].iloc[index]))
longtitude.append(abs(king_station_position[1] - listing['longitude'].iloc[index]))
# sqrt(latitude^2 + longtitude ^ 2) = linear distance
for index in range(len(latitude)):
distance.append(math.sqrt(latitude[index] ** 2 + longtitude[index] ** 2))
Then we visualize it:
# visualizing
# price / accomodates means price per person
plt.scatter((listing['price'] / listing['accommodates']), distance, s=3)
plt.xlabel("Price")
plt.ylabel("Relative Distance")
plt.title("Relation between price and distance");
As you can see, the y axis is relative distance. It was calculated by latitude and longitude. We got a conclusion, that is, the distance and the price don’t have positive correlation. It go against our intuition. The closer doesn’t mean more expensive.
Q2: What about the relationship between the amounts of review and price? Are they positive correlation?
# the amount of review, sorting by listing_id. Type is series
amount_of_reviews = reviews.listing_id.value_counts().sort_index()
amount_of_reviews.head()
# sort the 'listing' by id, reset its index and drop the initial index
listing = listing.sort_values('id').reset_index(drop=True)
listing.head()
# Convert 'amount_of_reviews' to dataframe for merging, and reset its index
# After reseting its index, renaming its columns name
amt_reviews_df = amount_of_reviews.to_frame().reset_index()
amt_reviews_df = amt_reviews_df.rename(columns={'index': 'id', 'listing_id': 'review_amounts'})
amt_reviews_df.head()
# merge 'amt_reviews_df' and 'listing' on 'id'
# For some airbnb without reviews, use left method.
listing = listing.merge(amt_reviews_df, on='id', how='left')
listing.loc[:, ['id', 'accommodates', 'price', 'review_amounts']]
# visualizing
# price / accmmodates means price per person(Am I right? I'm not sure.)
plt.figure(figsize=(20, 10))
plt.scatter(x=(listing['price'] / listing['accommodates']),
y=listing['review_amounts'],
s=15)
plt.xlabel("Price")
plt.ylabel("The amount of reviews")
plt.title("Relation between price and reviews");
As we can see, the lower price they have, possibly the larger amount of reviews they have. They do have relation but not linear relation. I guess I miss some potential variable between these variables. Because the review doesn’t mean the outlier, I don’t remove any data.
Q3: Does it have busy season? If have, is it more expensive than usual?
Here we observe the price is object and have NaN values, so we’re going to remove the dollar sign and decimal point and fill the NaN values with 0.
For analyzing the busy season, I’m about to do the following:
- convert the ‘available’ columns’ t/f to 0/1
- calculate the total of a day of availability by adding all airbnbs’ data
- visualizing it
# convert the 'available' columns' t/f to 0/1
available_mapping = {'t': 0, 'f': 1}
calendar['available'] = calendar['available'].map(available_mapping)calendar
# here we change the data type from object to datatime.
# Also we calculate the mean on everyday. It can show us the availability.calendar['date'] = pd.to_datetime(calendar['date'])
availability = calendar.set_index('date').groupby(pd.Grouper(freq='d')).mean()
availability = availability.reset_index()
availability
# Then we plot
plt.figure(figsize=(20, 10))
plt.plot(availibility['date'], availibility['available'])plt.show()
As we can observe, January and July are busy season. Now we’re trying to analyze the monthly price.
# here we're going to convert the price to integer
calendar.price = calendar.price.fillna('$0.00')
calendar.price = calendar['price'].replace('[\$,)]','', regex=True).replace('[(]','-', regex=True).astype(float)
calendar.price
price_avg = calendar.set_index('date').groupby(pd.Grouper(freq='d')).sum()
price_avg = price_avg.reset_index()
price_avg
(I’m not sure my analysis is right or wrong) As we can see, January haas the lowest price. It can explain why the January is the busy season.