On Predicting NYC Airbnb Prices

4 min readJan 7, 2021

Using linear regression to model and predict the price-per-night of Airbnb units in Manhattan and Brooklyn. The repository for this project can be found here.

Using historical Airbnb data and sub-borough information, I attempted to predict the price of short-term rentals in New York City. The models were trained on ~40,000 observations across 17 features. This analysis will offer insights into the Airbnb market of NYC, but also an interesting look at feature importance across different regression models.

A Quick Baseline

I wanted to see how a simple linear regression model would perform without any feature engineering. I temporarily dropped any null values and 3 high-cardinality columns.

dirtymodel = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

The results weren’t looking very good.

training score: 0.1023793737284564
validation score: 0.11722447870494423
mean squared error: 31602.386988554113
mean absolute error: 63.86391352413782

The predictions looked even worse.

Feature Engineering

The main dataset for my model, containing the Airbnb information, was sourced from Kaggle. It had both continuous and discrete variables, totaling in 16 columns. I was particularly intrigued that it contained borough and community district information, which enabled me to add outside information to assist the models.

It was apparent Manhattan and Brooklyn dominated the unit count, and there were less than 1000 shared rooms in the dataset. I decided to focus only on the two boroughs, and dropped any shared rooms.

I felt like the dataset was really lacking in description of the rental units, such as bed count, amenities, number of guests allowed, etc.. I found city data on the NYC government website for each of the 79 sub-boroughs within Manhattan and Brooklyn. I chose 3 new features- the percent of people within 1/2 mile of a subway, total number of housing units, and the income diversity ratio(80th percentile income / 20th percentile income). I thought the subway column was particularly useful, as tourists probably want to be close to public transportation.

The Models

Linear Regression

linreg = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])
linreg training score: 0.09452500898049476
linreg val score: 0.1304066533875079
linreg MSE: 40253.50872955769
linreg MAE: 76.55144226573168

2. Linear Regression with Polynomial Features

linregpoly = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(2)),
    ('regressor', LinearRegression(n_jobs=-1))
])
linregpoly training score: 0.1241331172654272
linregpoly val score: 0.1328469607735322
linregpoly MSE: 40140.54680885212
linregpoly MAE: 74.67645655822994

3. Ridge Regression

ridge = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('regressor', Ridge(alpha=2.5,
                       max_iter=500))])
ridge training score: 0.0945250083011675
ridge val score: 0.1304074577643093
ridge MSE: 40253.471494925296
ridge MAE: 76.55008198205667

4. Ridge Regression with Polynomial Features

ridgepoly = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(2)),
    ('regressor', Ridge(alpha=2.5))
])
ridgepoly training score: 0.12406018948652564
ridgepoly val score: 0.1327247324994002
ridgepoly MSE: 40146.20476026002
ridgepoly MAE: 74.66177496969786

5. Random Forest Regressor

rf = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(random_state=12, n_jobs=-1,
                                        max_depth=75,
n_estimators=150,
min_samples_leaf=4))])rf training score: 0.5086288173081757
rf val score: 0.18808424825491832
rf MSE: 37583.61069326349
rf MAE: 67.65970099628112

6. XGB Regressor

xgb = Pipeline(steps=[
    ('enc', OrdinalEncoder()),
    ('scaler', StandardScaler()),
    ('regressor', XGBRegressor(random_state=73, n_jobs=-1,
subsample=.75, eta=.02, max_depth=5, n_estimators=500))
])xgb training score: 0.5723574921272706
xgb val score: 0.20252812015270927
xgb MSE: 36915.0033197236
xgb MAE: 67.5689748729195

Insights

The tree-based models returned the best scores, the XGB regressor being slightly better. Interestingly, the baseline model had the lowest mean-squared-error and mean-absolute-error. After taking a closer look at my tree-based models, I was surprised that the feature importances were pretty different: