Indiahacks Machine Learning competition is an All India machine learning competition conducted once in a year. I participated in the qualification round and secured 6th position(out of 6000 participants), which is Top 1%. Only top 60 participants were selected to participate in offline zonal round. However, I was unable to participate in zonal round since I was traveling.
Note: Code is not production ready yet, so not sharing it on github. Will share it when I get some free time.
The challenge was to predict the segment(pos, neg) based on the given features:
- ID: unique identifier variable
- titles: format “title:watch_time”, titles of the shows watched by the user and watch_time on different titles
- genres: same format as titles
- cities: same format as titles
- tod: total watch time of the user spread across different time of days (24 hours format)
- dow: total watch time of the user spread across different days of week (7 days format)
Features extracted from text variables:
Titles variable: I used word embedding using word2vec, a deep learning technique which maps similar words to context after trying Bag of Words. Word embedding improved my validation score significantly.
Tod variable: Several features were extracted from total watch time column out of which I ended up using the following features:
- watch time counts at hours (0 to 23) mapped from t0 to t23
Cities variable: Several features were extracted out of which I ended up using the following features:
Genres variable: Extracted the genres from this column and mapped each genre as a binary feature.
Apart from these, I extracted the following features:
Other text related features which improved by validation score are:
- titles length
- titles count
- cities strings
Features extracted from numeric columns:
- Binary features for each day of the week starting from Monday to Sunday.
- Watch time spent on each day from Monday to Sunday.
I tried several different models which produced the following scores:
|Linear Discriminant Analysis||0.79|
|K Nearest Neighbors||0.59|
|Extra Trees Regressor||0.70|
|LightGBM(after hyper parameter tuning)||0.822|
|Xgboost(after hyper parameter tuning)||0.821|
I ended up using LightGBM to generate my predictions.
Hyper parameter tuning:
Used hyperopt to automatically find the right hyper parameters which improved my validation score for LightGBM model
The competition rewarded contestants who did feature engineering.
Things that I tried which din't work:
- Bag of Words approach.
- Dimensionality reduction on Word2vec features.
- Several extracted numerical and text features.