Challenge Overview
Background
Previously, we have run a series of code challenges related to this forecasting problem. In this challenge, we are focusing on the normalized net migrations across three products, i.e., Tortoise, Rabbit and Cheetah.
Challenge Objectives
For Net Migrations (Normalised) for the three products, the objective of this challenge is to generate a forecast model that minimises the error (without overfitting the model), measured as MAPE, when compared to the actual performance. The model should be tailored to a 12mth forecast horizon but must be extendable beyond this time period. Accuracy will be measured over a 6mth period given the limited data set available, measured by a reduction in MAPE.
Net Migrations (Normalised) is the target variable for which a forecast model must be generated. This variable set, for each product, has been normalised to reflect the performance in a standard trading month  see section on Trading Days below. This variable set is included in the privatised data set. The privatised actual performance, prior to normalisation, has also been included for reference but challenge objective will be based on Net Migrations (Normalised) variable.
Challenge Details
Baseline Models
We will provide two baseline models: LSTM and Sarimax models. This code can be found in the Code Document forums.
Training Data
The training data set covers from all data before 18/19_Q4_Mar. Each row described an item on a certain date as follows. The password for the file can be found in the Code Document forums.

Generic Group

Generic Brand

Generic Product Category

Generic Product

Generic Variable

Generic SubVariable

Generic LookupKey

Units

Time Period (a month)
The items include metrics like revenue, volume base, gross ads, leavers, net migrations and Average revenue per customer (see Background section) for Broadband for the Consumer market and also broken down by the Product level.
The ground truth file has the same number of rows, but only has one column, i.e., the revenue. You can use this data set to train and test your algorithm locally.
Testing Data
The testing data set covers from a few months starting from 18/19_Q4_Mar till now. It has the same format as the training set, but there is no groundtruth provided.
You are asked to make predictions for the testing data. You will need to append the last column of “Value” into the testing data. The newly added column should be filled by your model’s predictions.
Measurement
We will evaluate your predictions based on holdout test cases using MAPE, which will be introduced later.
Additional Information
Business Insight:
The three products are broadband products:
•Tortoise (legacy product, declining product, available everywhere)
•Rabbit (biggest product, reaching maturity, available in most of the country)
•Cheetah (best and most expensive product, new and growing rapidly but only available in limited geographies)
There is no obligation for customers to upgrade to newer/better products. The footprint of Cheetah is small but growing. Many customers do not upgrade immediately when new product is available. Uptake lags footprint.
Net Migrations is the difference in the number of existing customers with Sandesh Brand 1 that move onto and off a specific broadband product per month. A positive net migrations value (before Privatisation) means that more customers are moving onto the product than moving off. Therefore for Tortoise, a legacy product approaching ‘end of life’, Net Migrations is negative since customers are mostly upgrading from this product to the superior Rabbit product. Rabbit Net migrations are positive since a large number of customers are upgrading to this product from Tortoise; however a much smaller number are starting to upgrade from Tortoise and Rabbit to the new Cheetah product.
The relationship between Net Migrations across the three products:
Net migrations, when considered across all three products, sum to zero. Since Net migrations reflect the movement between products, then ‘Net migrations  Tortoise’ + ‘Net migrations  Rabbit’ + ‘Net migrations  Cheetah’ = 0. Up until the launch of Cheetah in late 2017, ‘Net migrations  Tortoise’ + ‘Net migrations  Rabbit’ = 0.
The relationship between key financial variables

Volume Closing Base for a Product = Volume Opening Base for that Product + Gross Adds – Leavers + Net Migrations to that Product

Volume Net Adds = Volume Closing Base – Volume Opening Base

Revenue = Average Volume over the period * Base ARPU
Net Migrations series  Rabbit as an example ...
Note: Net migration  Tortoise is the mirror image of Rabbit trend.
Net migrations  Rabbit exhibits some distinctive Time series patterns.
Trends:
There would appear to be potential 4 periods of different trends from April ‘11 to August ‘19 including a short term peak in early 2019. Business decision behind these trend shifts are being investigated.
Seasonality:
When considering the 6 point moving average of the normalised data set, seasonality is clearly evident in the last three years at least  peaking in Dec / Jan every year, and troughs every August since 2016.
Noise:
Trading days’ impact has been removed in the normalised data set. Though this removes some of the monthly peaks in the actual data, normalised data set remains variable.
Trading Days’ impact has been removed
Sandesh reports their financials in trading months, weeks and days. All trading months have a round number of trading weeks  either 4 or 5  so as to maintain consistency as the units roll up. This means that any given month must have either 28 or 35 trading days. This has been found to have a very significant impact on the forecast especially for Gross Adds and Leavers, and Net Migrations.
To allow for this irregular and somewhat artificial ‘noise’ in the key variables for Gross Adds, Leavers and Net Migrations, these variables have been normalised to a standard 30.3 day month prior to Privatisation. Predictions are therefore required for these normalised values, and the ‘noise’ will be added back in after predictions.
The Coefficient of variation for Net Migrations on the actual data and normalised data is:
Tortoise product: 71, 37.5 respectively (Actual and Normalised)
Rabbit product: 196, 77 respectively.
Therefore while variation remains significant in the normalised data set, it is greatly improved through this process of normalisation.
Data regarding the number of trading days for each month is provided for information.
Financial year modeling
Financial year for Sandesh is April to March (instead of January to December), hence Q1 is April, May and June.
Challenge structure
Anonymised and Privatised data set:
‘Zscore’ is used to privatise the real data.
For all the variables, following is the formula used to privatise the data:
zi = (xi – μ) / σ
where zi = zscore of the ith value for the given variable
xi = actual value
μ = mean of the given variable
σ = standard deviation for the given variable
Modeling Insight derived from previous challenges.
An LSTM model (see code included) and a SARIMAX (univariate model) have proven the best algorithms for predicting the target variables in this data set. These codebases can be found in the Code Document forum for this challenge.
LSTM is proven successful on customer movement variables  Gross Adds and Leavers; while SARIMAX is the most successful on the ‘smooth curve’ variables that describe the customer base  Closing Base, ARPU and Revenue.
However neither model has demonstrated a capability to accurately predict Net migrations thus far. Both models are included as a foundation or starting point, but it is anticipated that they will need modifying to become aware of the factors driving the ‘trend’ changes over the data set.
Final Submission Guidelines
Submission Format
You submission must include the following items

The filled test data. We will evaluate the results quantitatively (See below)

A report about your model, including data analysis, model details, local cross validation results, and variable importance.

A deployment instructions about how to install required libs and how to run.
Expected in Submission
1. Working Python code which works on the different sets of data in the same format
2. Report with clear explanation of all the steps taken to solve the challenge (refer section “Challenge Details”) and on how to run the code
3. No hardcoding (e.g., column names, possible values of each column, ...) in the code is allowed. We will run the code on some different datasets
4. All models in one code with clear inline comments
5. Flexibility to extend the code to forecast for additional months
Quantitative Scoring
Given two values, one groundtruth value (gt) and one predicted value (pred), we define the relative error as:
MAPE(gt, pred) = gt  pred / gt
We then compute the raw_score(gt, pred) as
raw_score(gt, pred) = max{ 0, 1  MAPE(gt, pred) }
That is, if the relative error exceeds 100%, you will receive a zero score in this case.
The final score is computed based on the average of raw_score, and then multiplied by 100.
Final score = 100 * average( raw_score(gt, pred) )
We will use this as a part of evaluation.
Judging Criteria
Your solution will be evaluated in a hybrid of quantitative and qualitative way.

Effectiveness (80%)

We will evaluate your forecasts by comparing it to the groundtruth data. Please check the “Quantitative Scoring” section for details.

The smaller MAPE, the better.

The model must achieve better performance than the provided baseline models.


Clarity (10%)

The model is clearly described, with reasonable justifications about the choice.


Reproducibility (10%)

The results must be reproducible. We understand that there might be some randomness for ML models, but please try your best to keep the results the same or at least similar across different runs.
