Data Science Game 2017

09-10-2017

Qualification Highlights

  • Deezer user sessions
  • y = 1 if user listened to recommended track, else y = 0
  • ROC-AUC metric
  • test distribution is very different from train
  • no clear way to create good validation set
  • rolling/cumulative statistics lead to overfitting

Qualification Solution

  1. smoothed target coding
  2. FM out-of-fold prediction
  3. xgboost + lightgbm

Finals

  • ~ 14k product types
  • ~ 800k orders from ~10k customers in 4 countries
  • 40 features
  • ~ 5 years train / 3 months train
  • predict demand for (item, country, month) triplets
  • MAE metric

Correct Pipeline

  1. add entries with zero demand for unobserved (item, country, month) triplets
  2. add lagged rolling window statistics for target variable (min/max/mean/sd/median/mad/… inside groups of categorical features)
  3. add point forecast from exponential smoothing/ARIMA/RNN/etc.
  4. feature engineering
  5. train LightGBM models with L1/Huber or fair loss for different y lags to get 10 MAE

Our Pipeline

  1. add entries with zero demand for unobserved (item, country, month) triplets
  2. add lagged demand
  3. embed categorical features with StarSpace
  4. extract principal components from (Month - Item - demand) matrix
  5. add exponential smoothing forecast
  6. train LightGBM with L1/Huber or fair loss to get 15 MAE
  7. submit exponential smoothing forecast 11.6 MAE
  8. tune exponential smoothing to get 11 MAE
  9. post-process probabilities to get 10.7 MAE

Featureless Solution

data %>%
  complete(country, item, month, 
           fill = c("demand" = 0)) %>%
  group_by(country, item) %>%
  mutate(demand = ets(demand, "ANN"))