Udacity nanodegree project 1

This article is about the udacity nanodegree project1.

The Description from Kaggle(Data Resource)

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.

Content The following Airbnb activity is included in this Seattle dataset:

  • Listings, including full descriptions and average review score
  • Reviews, including unique id for each reviewer and detailed comments
  • Calendar, including listing id and the price and availability for that day

First Step: Data understanding. Try to define the questions.

Question 1: What’s the relation between position and price?

By searching, I found the Seattle train station is King Street Station. Then I use this website to get its latitude and longitude. I’ll set the King Street Station as the center. By getting its latitude and longitude, I can calculate the distance between the airbnb’s position and the station.

This is like Eucildean distance:

By using the same idea, I can calculate the relative distance(Not the absolute distance. But if we want, we can convert it to absolute distance)

Then the code is following:

Then we use the describe function to obtain the statistical description. As the following, I found some data have outlier by the mean plus or minus 3 standard deviation.

After data cleaning, the value is:

The mean in figure(1) is almost 128, and the std is about 90.

The code did remove the outlier from the data.

Afterward, we are going to calculate the distance:

Then we visualize it:

As you can see, the y axis is relative distance. It was calculated by latitude and longitude. We got a conclusion, that is, the distance and the price don’t have positive correlation. It go against our intuition. The closer doesn’t mean more expensive.

Q2: What about the relationship between the amounts of review and price? Are they positive correlation?

As we can see, the lower price they have, possibly the larger amount of reviews they have. They do have relation but not linear relation. I guess I miss some potential variable between these variables. Because the review doesn’t mean the outlier, I don’t remove any data.

Q3: Does it have busy season? If have, is it more expensive than usual?

Here we observe the price is object and have NaN values, so we’re going to remove the dollar sign and decimal point and fill the NaN values with 0.

For analyzing the busy season, I’m about to do the following:

  1. convert the ‘available’ columns’ t/f to 0/1
  2. calculate the total of a day of availability by adding all airbnbs’ data
  3. visualizing it

As we can observe, January and July are busy season. Now we’re trying to analyze the monthly price.

(I’m not sure my analysis is right or wrong) As we can see, January haas the lowest price. It can explain why the January is the busy season.