AIR: K-Nearest Neighbors

SC
Sebastian Cajamarca
October 26, 2024 • 5 min read

Using data science and K-Nearest Neighbors to optimize Airbnb investment decisions in Medellín, Colombia.

K-Nearest Neighbors visualization for Airbnb locations in Medellín

As a Data Scientist, my daily life revolves around bytes, insights, algorithms, and other digital pursuits. Despite this, I've always been drawn to invest in something more tangible. For quite some time, I had my sights set on investing in an Airbnb property in Medellín, widely recognized as one of the most innovative cities globally (source: BBC News).

After much research and anticipation, last year we discovered a fantastic investment opportunity in Laureles, a neighborhood recently named one of the coolest in the world (source: Time Out).

Leveraging my fiancée's expertise in civil engineering and home décor, we transformed the property into an extraordinary space. It's now beautifully designed to comfortably accommodate up to five guests, ready to welcome any type of traveler.

The Investment Challenge

Before renting out our property, we needed to make sure our investment would be profitable, especially with the current high interest rates. We wanted to see if this was a smart move given the opportunity cost.

To do this, I tapped into my data science skills to collect Airbnb listing data. I focused on key details like location, number of guests, price per night, ratings, and reviews. Using this data, I aimed to estimate a good nightly price that would keep occupancy high and ensure profitability.

This guide explains my process. I hope it helps other data-driven entrepreneurs considering similar investments. You can also experiment with the data using this tool: Air_knn.

Gathering the Data

Scraping data from Airbnb is no simple task, especially when you're aiming to gather precise geolocated information. I discovered that by supplying two coordinate points to the Airbnb search URL, you can retrieve listings within that specific rectangular area. By providing the northeast and southwest points like this:

ne_lat=6.2568&ne_lng=-75.5645&sw_lat=6.2123&sw_lng=-75.6012

Airbnb generates a map view where each listing is accompanied by its exact coordinates. This method ensures you know the precise location of each listing and can accurately capture their coordinates.

Thanks to H3 library from Uber I get to divide Medellin main Neighborhoods (Laureles and Poblado).

By using this approach, I was able to efficiently optimize the search for listings and extract their coordinates and features. As a result, I successfully gathered data from 2,380 listings, complete with all relevant characteristics. Now, it's time to move on to creating the estimation model and suggesting optimal prices.

Pricing Model

With all this data in hand, my next step was to develop a generalized model to explore the relationship between occupancy and price per night. The aim was to estimate the optimal price that would maximize revenue by balancing price and occupancy.

Airbnb listings come with a wide variety of amenities, but for this initial model, I chose to focus on a few key features to clearly understand the relationship between price and occupancy. Therefore, the model considered factors like the number of beds, guests, baths, location, and price.

As I ran the first model, I noticed collinearity between the price per night and the number of guests. To address this, I opted to use the price per guest per night instead.

Beta Regression

The initial model I tested was a beta regression model, which is similar to linear regression but restricted to producing outcomes (Y) between 0 and 1. By evaluating the Mean Absolute Percentage Error (MAPE) through cross-validation, I found that this straightforward model achieved an error rate below 23%. While not perfect, this level of accuracy is sufficient to begin testing some hypotheses.

# Step 3: Set up K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
mae_scores = []
test_df = test_df[['occupancy' , 'price_per_guest' , 'number_of_beds' , 'number_of_bathrooms' ]]
# Step 4: Perform manual cross-validation
for train_index, test_index in kf.split(test_df):
    # Split data
    train_data, test_data = test_df.iloc[train_index], test_df.iloc[test_index]

    # Fit a Beta regression model (using GLM with logit link for Beta distribution)
    model = glm(formula='occupancy ~ price_per_guest + number_of_beds + number_of_bathrooms  ', 
                data=test_df, 
                family=Binomial()).fit()
    
    # Predict on test data
    predictions = model.predict(test_data)

    # Clip predictions to avoid values outside (0, 1)
    predictions = np.clip(predictions, 1e-5, 1 - 1e-5)

    # Step 5: Calculate MAE for this fold
    mae = mean_absolute_error(test_data['occupancy'], predictions)
    mae_scores.append(mae)


print(f"Mean MAE across 5 folds: {np.mean(mae_scores)}")
print(f"Standard Deviation of MAE: {np.std(mae_scores)}")

I aimed to use the estimated coefficient for price per guest to determine the demand elasticity. However, since the model assumes linearity, it didn't accurately capture the diminishing returns associated with marginal price changes. As a result, the elasticity estimation wasn't reliable.

Random Forest Regression

The next step involved using Random Forest Regression to explore how different pricing scenarios affect occupancy. By adjusting the price input and observing the changes in occupancy, I could calculate the percentage change in occupancy relative to price changes to estimate price elasticity. This approach captures complex, non-linear interactions, helping me understand how occupancy varies with different prices to create effective pricing strategies.

However, it didn't work as well as expected because I initially overlooked important features that influence price, such as amenities and image quality (with models like Vision language models or captioning models), which reflect the apartment's luxury and decor. I'll address these factors in the next model, but for now, I needed something simple and effective to deploy quickly.

K-Nearest Neighbors

Ultimately, the best approach for analyzing price versus occupancy for similar listings was to use the K-Nearest Neighbors (KNN) method. By identifying the most similar listings in the area, I could estimate their revenue based on their price per guest and occupancy rates. This allowed me to pinpoint the optimal balance where price and occupancy levels maximize revenue. Consequently, this method provided a suggested price per night, a revenue estimate, and a list of comparable nearby listings.

Conclusions

While this model isn't perfect, it offers a starting point for Airbnb entrepreneurs to estimate occupancy and pricing. Although it's missing a lot of information, it helped us gauge how competitive our listing is and set expectations for our investment.

For future models, I plan to incorporate several improvements:

  • Add amenities to the features, which will require more data to enhance prediction accuracy.
  • Schedule the scraper to continuously update occupancy metrics, providing more accurate insights over time.
  • Use vision and captioning models to analyze images and extract features related to home decor, luxury, and other amenities.
  • Expand the model to include more cities, allowing more users to make well-informed decisions.

At Churnless AI, we apply similar data science approaches to optimize business decisions across various domains.

If you're interested in leveraging data science for your investment or business decisions, contact us to learn more.

Churnless AI LogoChurnless AI

Advanced AI solutions for churn prevention and data integration

Quick Links

Contact Us

Email: info@churnless.ai

Address: Pompano Beach, Florida, United States

© 2025 Churnless AI. All rights reserved.