Identify which signals can be used to predict if user is planning his retirement
Analyze historical user data
Implement POC algorithm
Provide recommendations for any additional data sets that might be useful to increase performance
Our client, a global wealth management bank, is looking to analyze actions their clients perform in their web application. Main goal is to predict if the user is giving out signals that they are planning their retirement, ie opening a retirement account.
In a follow-up challenge you will get access to all the winning submissions and review feedback from this challenge to build the final predictor.
Data Analysis and prediction algorithms should be implemented using Python. If you want to use a different technology, ask for confirmation in the forums.
Data set available is huge - 10s of GB of clicks users performed in the application. You will have access to a subset of this data set - roughly 50%. Remaining data will be used in future challenges.
Besides the clickstream data there are a few other files:
Clients data - info about the clients
Accounts data - info about the accounts
Client Account Relationship - a one to many relationship between the clients and the accounts
Account classification - details about types of accounts
Derived client accounts - sub accounts linked to the main client account
Account Cach Balance - balances for cash accounts (daily)
Account Positions - info about investment positions for the accounts
Tables details.xlsx” document provides info about various tables and columns in the data.
Main goal is to analyze the data, engineer the necessary features and build an algorithm for predicting if user is planning his retirement. The idea is to use the clickstream data (primary data source) in combination with client info (ex age), account info and balances to predict if user is planning his retirement.
Opening a retirement account is considered the ground truth - when the predictor algorithm is run (for example once per day for each client), inputs are all the user actions and account info available until then and the output is probability of that client opening a retirement account in the next 7 days.
Please note that for the training you have the entire click history - even after opening retirement account - predictor must NOT have any click or account history info for dates after opening the retirement account. Using this data is not relevant for the real world use case and your submission will be disqualified if it’s using new data to predict opening retirement account in the past.
“Account classification” table has info about the various types of retirement accounts. See the “ClientAccountClissification” sheet in the data description document for a list of retirement account codes.
Some simple signals for retirement planning can be:
User have used retirement planning features in the website
Users age is between 45 to 59
Don't have retirement account
In this challenge entire clickstream history and user account performance info is available for analysis and building the predictor. But it is probably not necessary to have entire account history to accurately predict retirement planning - for example 10+ years old data is probably not relevant. Performance(speed) will be one of the important properties for the final predictor, so your submission should give details on how much of data should be given as input to the predictor without having effect on accuracy. If some of the historical data can be aggregated and precalculated that will also have an effect on the performance (for example one of your features might be “total assets at age 25” that can be calculated once and the predictor doesn’t need all the earlier data). We would have to build the pipeline for aggregating the data so make sure to clarify the exact input data requirements for your algorithm.
It is up to you to analyze the data and figure out the appropriate features for prediction - be creative! All the features should be clearly derived from the input data set, without using any other external data sources. All analysis should be backed up by data analysis done in Python.
Review will be highly subjective and done by the client. No appeals will be allowed.
Your submission should contain:
Summary document explaining the data characteristics and outlining the main findings - graphs and other visuals are highly encouraged
Data analysis scripts with environment setup and instructions on how to run the analysis
Predictor training and testing scripts with deployment/verification instructions