Predict the customer segment in Advertising with 0.82 AUC score - Top 1% in IndiaHacks Machine Learning Competition

Top 1% in IndiaHacks Machine Learning Competition - Customer Segmentation

Indiahacks Machine Learning competition is an All India machine learning competition conducted once in a year. I participated in the qualification round and secured 6th position(out of 6000 participants), which is Top 1%. Only top 60 participants were selected to participate in offline zonal round. However, I was unable to participate in zonal round since I was traveling.

Note: Code is not production ready yet, so not sharing it on github. Will share it when I get some free time.

The challenge was to predict the segment(pos, neg) based on the given features:

  • ID: unique identifier variable
  • titles: format “title:watch_time”, titles of the shows watched by the user and watch_time on different titles
  • genres: same format as titles
  • cities: same format as titles
  • tod: total watch time of the user spread across different time of days (24 hours format)
  • dow: total watch time of the user spread across different days of week (7 days format)

Model Pipeline:

Segment Hotstar Competition Pipeline

Features extracted from text variables:

Titles variable: I used word embedding using word2vec, a deep learning technique which maps similar words to context after trying Bag of Words. Word embedding improved my validation score significantly.

Tod variable: Several features were extracted from total watch time column out of which I ended up using the following features:

  • tod_median_time
  • tod_min_time
  • dow_max_time
  • watch time counts at hours (0 to 23) mapped from t0 to t23
  • tod_start
  • tod_end
  • Days

Cities variable: Several features were extracted out of which I ended up using the following features:

  • cities_min_time
  • cities_count

Genres variable: Extracted the genres from this column and mapped each genre as a binary feature.

Apart from these, I extracted the following features:

  • genres_min_time
  • genres_max_time
  • genres_mean_time
  • genres_median_time

Other text related features which improved by validation score are:

  • titles length
  • titles count
  • cities strings

Features extracted from numeric columns:

dow variable:

  • Binary features for each day of the week starting from Monday to Sunday.
  • Watch time spent on each day from Monday to Sunday.

Model Evaluation:

I tried several different models which produced the following scores:

Model Score
Logistic Regression 0.70
Linear Discriminant Analysis 0.79
K Nearest Neighbors 0.59
CART 0.55
AdaBoost 0.79
Gradient Boosting 0.80
Random Forests 0.69
Extra Trees Regressor 0.70
LightGBM(after hyper parameter tuning) 0.822
Xgboost(after hyper parameter tuning) 0.821

I ended up using LightGBM to generate my predictions.

Hyper parameter tuning:

Used hyperopt to automatically find the right hyper parameters which improved my validation score for LightGBM model The competition rewarded contestants who did feature engineering.

Things that I tried which din't work:

  • Bag of Words approach.
  • Dimensionality reduction on Word2vec features.
  • Several extracted numerical and text features.