Airbnb New User Prediction — A Kaggle case study

Namrata Thakur
Analytics Vidhya
Published in
15 min readJan 17, 2021

--

Picture Credit: Me..!

As people become more and more curious about their surroundings, they start to sail across borders to explore distant places. This curiosity of man gave birth to the tourism industry. Over the years, the industry has boomed across the world and has started providing different types of accommodations and experiences to cater to a wide variety of people’s needs.

A very prominent name in this industry is a startup (now a multi-billion dollar company) called Airbnb. Airbnb, founded in 2008, started in San Francisco and expanded rapidly, and is now operating in hundreds of cities around the world.

Contents:

  1. Business Problem
  2. Mapping the real-world problem as an ML problem
  3. Data set Analysis
  4. Real-World Business Constraints
  5. Performance Metrics
  6. Existing Approaches
  7. My Improvements
  8. EDA
  9. Feature Engineering
  10. Modeling
  11. Kaggle Screenshot
  12. Future Work
  13. References
  14. Github Repository link
  15. Linkedin profile
  1. Business Problem :

Airbnb has become a very popular choice among travelers around the world for the kind of unique experiences that they provide and also for presenting an alternative to costly hotels. It is currently present in more than 34000+cities across 190 countries. Customers can make their bookings either through the website application or using the iOS/Android application. Airbnb is consistently trying to improve this booking experience to make it easier for the first time customer.

The problem that this case study is dealing with predicts the location that a user is most likely to book for the first time. The accurate prediction helps to decrease the average time required to book by sharing more personalized recommendations and also in better forecasting of the demand. We use the browser’s session data as well as the user’s demographic information that is provided to us to create features that help in solving the problem.

2. Mapping the real-world problem as an ML problem :

This is a multi-class classification problem where given the user data we have to predict the top five most probable destinations among any of the 12 choices -the ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’ DE’, ‘AU’, ‘NDF’ and ‘others’.

‘NDF’ and ‘others’ are different from each other. ‘NDF’ means there has been no booking done for this user and ‘others’ means that there has been a booking but to a country not present in the list given.

3. Data set Analysis :

The dataset is taken from the Kaggle competition page.

a) Files given:

train_users, test_users, sessions, countries, and age_gender_bkts.

b) Total File Size: 64.71MB

c) Total number of records: 2,13,451 (train_users), 62,096 (test_users)

d) The first two files contain the individual information of the users i.e. age, gender, signup_method, language, country_destination (target), etc. The sessions.csv contains web session data of the users. Each record in this dataset is identified with the user_id field that is corresponding to the id field of the train datasets. We find several session records containing information from the different times the particular user has accessed the Airbnb application.

e) The sessions.csv has data of users from 2014 onwards whereas the training dataset has records dating back to 2010. This means that the session records of many users are not available. Only 35% of the train users and 99% of the test users have records in the session dataset.

f) The last two datasets contain more general statistical information of the destination and the users’ respectively.

4. Real World Business Constraints :

a) Low latency is important.

b) Mis-classification cost is not considerably high as the user can very easily change the destinations if he/she doesn’t like the given recommendations.

c) Interpretability of the result is not much needed. As the user is not much concerned about how the places are recommended.

5. Performance Metrics :

Since this is a multi-class classification problem, we can use metrics like multi-class Logloss and F1 score. But the Kaggle competition required us to use NDCG (Normalized discounted cumulative gain) score.

6. Existing Approaches :

a) Only 35% of train users have session records. People who used only the train data and not the session data got a relatively low score.

Example:

b) The following kernel has used only the count of the session record per user as a new feature. It has discarded all other session features. The feature importance plot as given later proves that the session features are very important in modeling. Hence, it has got a comparatively low score.

7. My Improvements :

a) Used both data Train.csv and Session.csv to feature engineering.

b) The session CSV has multiple records for every ID/user, where each record captures the users’ actions and the time spent doing that action on Airbnb. We grouped all the records per user.

c) Used unigram and bigram of TF-IDF vectorizer to capture the prevalence and rarity of each action/device.

d) Processed the ‘age’ and date column extensively.

8. EDA :

Let’s first read the train_users.csv file and check the columns.

We see that there are some 2,13,451 records and a total of 16 columns. Out of these 16 columns,3 columns contain null values. Let's check what percent of values are null in those columns.

Three fields have null records.

With this basic information in hand, let's start our univariate analysis.

Univariate Analysis:

a) Target Column : ‘country_destination

The dataset is highly imbalanced. The majority of the customers (58.35%) have not made a booking. Among those that have, a large share of users (29.22%) booked for a destination in ‘US’. Among the countries given outside of the US, France (‘FR’) has a high share of 2.35%. Also, a considerable percentage of users (4.73%) traveled to a destination not present among the options given.

b) Column: ‘Gender

We see there are 4 categories in the Gender field. The majority of the users (44.83%) have not disclosed their gender. Among those who have, female users are marginally more than the male users with 29.53% as female and 25.5% as male. A small percentage (0.13%) of users have selected ‘other’ as their gender. This means that they might belong to the non-binary category.

c) Column: ‘signup_method

The majority of the users (71.63%) have used the basic method to create an account on the Airbnb application. This means that they might have used a normal email-password as a method to signup. Among the remaining users, a huge majority (28.11%) have used Facebook to signup. A very few percentages (0.26%) of people used google to signup.

This concludes that google is not much preferred as a signup method. Also, it can mean that users are less inclined to share their personal information in the form of Facebook or google linkup with their Airbnb account and hence they have chosen the basic email-password method to access their account.

d) Column: ‘Language

The majority of users (96.66%) have English as their language. Since the data is from the US therefore such behavior is quite expected as a majority population of that country identifies English as their primary language.

e) Column: ‘affiliate_channel

This field explains what kind of paid marketing is used to reach the users i.e. what are the different channels/categories that are used. Among the 8 different categories, ‘direct’ is the most useful as it reached the majority of the users (64.52%). There is a sharp fall in the percentage as we go to the second-most-important category ‘semi-brand’ that reached an approx 12.2% users. ‘content’ and ‘remarketing’ have reached a very small percentage of users i.e 1.85% and 0.51% respectively. These are of not much use as a channel to market the application. They should either be dropped to save the cost of marketing or be improved so that it can reach more users.

f) Column: ‘signup_app

The majority of the users (85.6%) used ‘Web’ to create the account. Among the remaining, ‘iOS’ is marginally more used (approx (5.98%) more) than ‘Moweb’ and ‘Android’. ‘Moweb’ and ‘Android’ are not much popular apps to signup as it is used by only 2.93% and 2.56% respectively of the users.

g) Column: ‘first_device_type

The most popular device that is used to create an account is ‘Mac Desktop’ that is used by 41.98% of the customers. It is closely followed by ‘Windows Desktop’ with 34.07% of customers. Among the remaining categories, ‘iPhone’ is more popular than ‘Android Phone’ in the mobile domain. Similarly, ‘iPad’ is more popular than ‘Android Tablet’ in the tablet domain. Overall we can conclude from this plot that Apple products are more popular than Android ones across all categories (desktop, mobile, tablet).

h) Column: ‘first_browser

This field tracks the first browser that the customer used to access the application. A large share (29.91%) is taken up by ‘Chrome Browser’ followed by ‘Safari’ at 21.16% and by ‘Firefox’ at 15.77%. From the previous plot, it is known that ‘Mac Desktop’ is the most popular device among users. From this plot, it is known that a large population of the customer preferred ‘Chrome’ as their browser even on their Apple products instead of Apple’s own browser ‘Safari’. Another interesting fact is that for a considerable percentage (12.77%) of customers the first browser is not known.

i) Column: ‘date_account_created

We created 3 columns from this date field namely ‘account_created_day’, ‘account_created_month’, and ‘account_created_year’. After this, let’s analyze the month and year columns separately.

‘June’ (06) is the most favorable month among the customers to create an account closely followed by ‘May’ (05). The least number of accounts are created during ‘November’ (11) and ‘December’ (12). That time is a holiday period for the majority of people around the world so they might be actually traveling during the time.

The number of accounts created has increased over the years from 2010 to 2014. For 2014 only the accounts created between January to June are given in the training dataset.

Now let’s check for the Month and Year wise increment to check seasonal variability.

For the months of April (04) and May (05), we see an increment in account creation for all the years. Similarly for months June (06) and July (07) we see such an increment. Months August (08) and September (09) saw an increment but only for years 2011 and 2013 while for 2012 we see a dip in the number. For the year 2014 till June, the least number of accounts are created in February whereas the highest count is observed for the month of June. Also, compared to the other years, a considerable more number of accounts are created in 2014 for all the months.

j) Column: ‘age

As seen earlier, a total of 87,990 users have not disclosed their age. Let’s check for the distribution of values of this column.

Checking the percentile values to detect the outlier in the minimum value :

0.04 percentile value is 5.0 and 0.05 percentile value is 15.0. This shows that anything below 0.05 percentile value is an outlier. So we are taking the minimum permissible value for age as 15.0 and anything else than this is handled through processing as mentioned below.

Checking the percentile values to detect the outlier in the maximum value :

99.29 percentile value is 110.0 and 99.39 percentile value is 1949(approx). So this shows that anything above 99.29 percentile value is an outlier and is handled through processing as mentioned below.

We see some people have given age values as 19xx or 20xx.

We can assume that those who have given 19xx as an age have mistakenly put their year of birth instead of their age.

Those who have given 20xx as an age have inputted any arbitrary value as their age.

We are processing the values of the ‘age’ field using any of the following 4 cases :

  1. Fill the null values with the median value i.e. 34.
  2. We replace any value less than the minimum age found i.e. 15.0 or any value more than 2007 with the median value i.e. 34.
  3. We keep any value between 15.0 and 117.0 (age of the oldest person alive today) as it is.
  4. For an age that is greater than 117.0 and less than 2007, we assume that the user has mistakenly inputted their year of birth instead of age. So we subtract that value from the ‘account_created_year’ field value to get their age on the day they created the account.
Max Age before processing :  2014.0

After processing the values in the ways mentioned above we see:

Max Age After Year Imputation :  115.0count    213451.000000
mean 36.006526
std 10.794212
min 15.000000
25% 32.000000
50% 34.000000
75% 35.000000
max 115.000000

The statistics for ‘age’ seem far better now.

With this, we finish the univariate analysis stage.

Bivariate Analysis:

a) Let’s first bin the ‘age’ values into bins of 10-year gap ie ’15–20', ‘20–30' etc.

The age bin of ‘30–40’ forms the majority population for nearly all the destinations. For ‘NDF’,’ the US’ and ‘others’ we see the second most frequent age bin of users is ‘20–30’.

b) ‘country_destination’ and ‘gender’:

We see only for options ‘NDF’ and ‘US’ we have a generous count for different classes of gender. For other destinations, there are not many visible differences in the count for different genders.

c) ‘country_destination’ and ‘account_created_month’:

From the univariate analysis of the month-year column, we learned that a good majority of accounts are created in the month of June-July-August. Let's see for the month of July, what percentage of accounts are created for each destination.

For July, we see a high majority (56.41%) have not done a booking. 31.02% of users have done a booking for ‘US’. This behavior is expected as the dataset is highly biased towards ‘NDF’ and ‘US’.

Multivariate Analysis:

Analysing ‘age’, ‘country_destination’, and ‘gender’ :

For ‘NDF’, we see that the ‘age’ distribution for both ‘Male’ and ‘Female’ gender is the same. This is almost true for all other destinations also.

But for the ‘other’ gender we see that there is quite a variation in the spread of the ‘age’ variable across the different destinations.

For example, the spread of ‘age’ for ‘other’ gender is very small for destinations like ‘FR’, ‘GB’, and ‘NL’. On the other hand, it is quite large for ‘CA’, ‘IT’, and ‘DE’.

With this, we finished the EDA part.

9. Feature Engineering:

Let’s check the head of the sessions.csv file to understand the session data better.

We see that there are multiple session records for one user_id. We need to group these records so that we have only one record for each user_id.

We see that there are some session data for which the ‘user_id’ is absent. This doesn't make any sense because until and unless we have the user_id we will not know for which user this session data is applicable. So, we will only consider that session information that has a not-null user_id.

We have a total of 1,35,483 unique users’ session data. These users are present across the train and test files.

Concatenating rows of session data for each user :

Now we have one record for each user_id.

Creating session features:

For the ‘secs_elapsed’ field, we are creating 3 features :

  1. Total seconds spent by each user across many sessions
  2. Average seconds spent by each user
  3. Count of number of sessions per user

For the actions field, we see from the snapshot above that there are duplicates in the ‘action’, ‘action_type’, ‘action_detail’, and ‘device_type’ columns. For the feature engineering part, we are creating a list of only the unique actions/devices that the user has performed/used while accessing the application.

Example of features created from the ‘secs_elapsed’ fields.

Examples of features created from the ‘action/device’ fields.

Code for all these functions is present in the GitHub repo the link of which is given at the end.

After the feature creation stage, let’s merge the common users present in the train and session file :

Previously we have seen that we have a total of 1,35,483 unique users’ session records. Out of these, only 73815 users are present in the training file. Rest all are present in the test file.

We will be modeling using only these 73815 records. After the creation of the new features, the merged dataset has a total of 31 features.

Dropping the ‘timestamp_first_active’, ‘date_account_created’,’ action’, ‘action_type’, ‘action_detail’, ‘device_type’,’secs_elapsed’, ‘user_id’ because we have already preprocessed these features and created the needed processed ones.

Creating X and Y and splitting the Train data into Train and CV for modeling:

Let’s create the one-hot vectors for the categorical features.

Like this, we one-hot encoded all the remaining categorical features.

TFIDF Vectorization for the text fields :

After the creation of vectors for the different text/categorical fields, we are dropping the original non-vectorized (numerical) features from the train and cv dataset.

These 8 features are all numerical features.

Concatenating the vectorized features with the remaining original 8 numerical features :

Saving the final data frame and vectorizers that we have created for future use.

With these, we finish the feature engineering stage.

10. Modeling :

We have applied different ML models to the dataset that we have processed earlier. We have fit Logistic Regression, Naive Bayes, Random Forest, and GBDT Classifiers like XGBoost and LightGBM Classifiers. The best test result came from the Random Forest model. The worst result came from the XGBoost model, surprisingly. We have tuned the hyper-parameters for all the models and have plotted the heatmap to see how the tuning has worked. We have also plotted the feature importances for RF and XGBoost models.

Since the best model turned out to be the Random Forest model, we take the top 80% features of the best Random Forest model and further hyper-parameter tune it.

The best parameters are:

Let’s plot the feature importance graph:

We see that ‘age’ is the most important feature. It’s good that we have processed that feature with care. Also, we understand that session features are more important than others because we see that many session features like ‘view_booking_request’, ‘booking_request_submit’ etc. are among the top 25 features plotted.

We plotted heatmaps between hyper-parameters like ‘max_depth’ and ‘n_estimators’ for both train and cv datasets.

With this, let’s save this model so that we can use it later to deploy the application.

Let’s see a snapshot of the prediction that this model does on the test file.

Test Data Prediction

11. Kaggle Screenshot:

The Random Forest model gave the best private score of — 0.87883, which is nearly 20% of the leaderboard.

12. Future work:

a) More rigorous hyperparameter tuning.

b) Bigram and Trigram features using the TF-IDF vectorizer which will increase the dimensionality and model complexity significantly but may give a better score.

c) Using deep learning techniques like LSTMs to capture the time series information from the action columns.

That’s it for this case study. If you have any suggestions to improve it please leave them in the comments..!!

--

--