Udacity nanodegree project 1

The Description from Kaggle(Data Resource)

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.

  • Listings, including full descriptions and average review score
  • Reviews, including unique id for each reviewer and detailed comments
  • Calendar, including listing id and the price and availability for that day

First Step: Data understanding. Try to define the questions.

Question 1: What’s the relation between position and price?

By searching, I found the Seattle train station is King Street Station. Then I use this website to get its latitude and longitude. I’ll set the King Street Station as the center. By getting its latitude and longitude, I can calculate the distance between the airbnb’s position and the station.

# this step is for variable definition and convert the 'price' column to float(It was object initially)
# Also, because the price has dollar sign and decimal point, I'll remove it in this block of code.
king_station_position = [47.598330, -122.311640]
latitude = []
longtitude = []
distance = []
listing['price'] = (listing['price'].replace('[\$,)]', '', regex=True).replace('[(]', '-', regex=True)).astype(float)
# We're using the price column and there's no NaN value. - No need to data cleaning.
# remove outlier, mean +/- 3 std
delete = []
for i in range(len(listing['price'])):
if listing['price'].iloc[i] > (listing['price'].mean() + 3 * listing['price'].std()) or listing['price'].iloc[i] < (listing['price'].mean() - 3 * listing['price'].std()):
listing = listing.drop(delete)
128 + 90 * 3 = 398
# the distance between the airbnb's position and the station
for index in range(len(listing)):
latitude.append(abs(king_station_position[0] - listing['latitude'].iloc[index]))
longtitude.append(abs(king_station_position[1] - listing['longitude'].iloc[index]))

# sqrt(latitude^2 + longtitude ^ 2) = linear distance
for index in range(len(latitude)):
distance.append(math.sqrt(latitude[index] ** 2 + longtitude[index] ** 2))
# visualizing
# price / accomodates means price per person
plt.scatter((listing['price'] / listing['accommodates']), distance, s=3)
plt.ylabel("Relative Distance")
plt.title("Relation between price and distance");

Q2: What about the relationship between the amounts of review and price? Are they positive correlation?

# the amount of review, sorting by listing_id. Type is series
amount_of_reviews = reviews.listing_id.value_counts().sort_index()
# sort the 'listing' by id, reset its index and drop the initial index
listing = listing.sort_values('id').reset_index(drop=True)
# Convert 'amount_of_reviews' to dataframe for merging, and reset its index
# After reseting its index, renaming its columns name
amt_reviews_df = amount_of_reviews.to_frame().reset_index()
amt_reviews_df = amt_reviews_df.rename(columns={'index': 'id', 'listing_id': 'review_amounts'})
# merge 'amt_reviews_df' and 'listing' on 'id'
# For some airbnb without reviews, use left method.
listing = listing.merge(amt_reviews_df, on='id', how='left')
listing.loc[:, ['id', 'accommodates', 'price', 'review_amounts']]
# visualizing
# price / accmmodates means price per person(Am I right? I'm not sure.)
plt.figure(figsize=(20, 10))
plt.scatter(x=(listing['price'] / listing['accommodates']),
plt.ylabel("The amount of reviews")
plt.title("Relation between price and reviews");

Q3: Does it have busy season? If have, is it more expensive than usual?

  1. convert the ‘available’ columns’ t/f to 0/1
  2. calculate the total of a day of availability by adding all airbnbs’ data
  3. visualizing it
# convert the 'available' columns' t/f to 0/1
available_mapping = {'t': 0, 'f': 1}
calendar['available'] = calendar['available'].map(available_mapping)
# here we change the data type from object to datatime.
# Also we calculate the mean on everyday. It can show us the availability.
calendar['date'] = pd.to_datetime(calendar['date'])
availability = calendar.set_index('date').groupby(pd.Grouper(freq='d')).mean()
availability = availability.reset_index()
# Then we plot
plt.figure(figsize=(20, 10))
plt.plot(availibility['date'], availibility['available'])
# here we're going to convert the price to integer
calendar.price = calendar.price.fillna('$0.00')
calendar.price = calendar['price'].replace('[\$,)]','', regex=True).replace('[(]','-', regex=True).astype(float)
price_avg = calendar.set_index('date').groupby(pd.Grouper(freq='d')).sum()
price_avg = price_avg.reset_index()



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Joe Chao

Joe Chao