ico-magnifying_glass
    ico-arrow-big-left

    Clickstream - retirement predictor

    PRIZES

    1st

    $2,000

    2nd

    $1,000

    3rd

    $500

    4th

    $250

    Register
    Submit
    Next Deadline: Review
    1d 6h until current deadline ends
    Show Deadlinesicon-arrow-up

    Challenge Overview

    Challenge Objectives

     
    • Identify which signals can be used to predict if user is planning his retirement

    • Analyze historical user data

    • Implement POC algorithm

    • Provide recommendations for any additional data sets that might be useful to increase performance

     

    .

     

    Project Background

     

    Our client, a global wealth management bank, is  looking to analyze actions their clients perform in their web application. Main goal is to predict if the user is giving out signals that they are planning their retirement, ie opening a retirement account.

     

    In a follow-up challenge you will get access to all the winning submissions and review feedback from this challenge to build the final predictor.


    Technology Stack

     
    • Data Analysis and prediction algorithms should be implemented using Python. If you want to use a different technology, ask for confirmation in the forums.


    Data description

     

    Data set available is huge - 10s of GB of clicks users performed in the application. You will have access to a subset of this data set - roughly 50%. Remaining data will be used in future challenges.

     

    Besides the clickstream data there are a few other files:

    • Clients data - info about the clients

    • Accounts data - info about the accounts

    • Client Account Relationship - a one to many relationship between the clients and the accounts

    • Account classification - details about types of accounts

    • Derived client accounts - sub accounts linked to the main client account

    • Account Cach Balance - balances for cash accounts (daily)

    • Account Positions - info about investment positions for the accounts

        

    Tables details.xlsx” document provides info about various tables and columns in the data.


    Analysis requirements

     

    Main goal is to analyze the data, engineer the necessary features and build an algorithm for predicting if user is planning his retirement. The idea is to use the clickstream data (primary data source) in combination with client info (ex age), account info and balances to predict if user is planning his retirement.

     

    Opening a retirement account is considered the ground truth - when the predictor algorithm is run (for example once per day for each client), inputs are all the user actions and account info available until then and the output is probability of that client opening a retirement account in the next 7 days.

     

    Please note that for the training you have the entire click history - even after opening retirement account -  predictor must NOT have any click or account history info for dates after opening the retirement account. Using this data is not relevant for the real world use case and your submission will be disqualified if it’s using new data to predict opening retirement account in the past.

     

    “Account classification” table has info about the various types of retirement accounts. See the “ClientAccountClissification” sheet in the data description document for a list of retirement account codes.

     

    Some simple signals for retirement planning can be:

    • User have used retirement planning features in the website

    • Users age is between 45 to 59

    • Don't have retirement account

     

    In this challenge entire clickstream history and user account performance info is available for analysis and building the predictor. But it is probably not necessary to have entire account history to accurately predict retirement planning - for example 10+ years old data is probably not relevant. Performance(speed) will be one of the important properties for the final predictor, so your submission should give details on how much of data should be given as input to the predictor without having effect on accuracy. If some of the historical data can be aggregated and precalculated that will also have an effect on the performance (for example one of your features might be “total assets at age 25” that can be calculated once and the predictor doesn’t need all the earlier data). We would have to build the pipeline for aggregating the data so make sure to clarify the exact input data requirements for your algorithm.


    It is up to you to analyze the data and figure out the appropriate features for prediction - be creative! All the features should be clearly derived from the input data set, without using any other external data sources. All analysis should be backed up by data analysis done in Python.


    Review will be highly subjective and done by the client. No appeals will be allowed.

     

    Your submission should contain:

    • Summary document explaining the data characteristics and outlining the main findings - graphs and other visuals are highly encouraged

    • Data analysis scripts with environment setup and instructions on how to run the analysis

    • Predictor training and testing scripts with deployment/verification instructions


     

    Final Submission Guidelines

    • Summary document explaining the data characteristics and outlining the main findings - graphs and other visuals are highly encouraged

    • Data analysis scripts with environment setup and instructions on how to run the analysis

    • Predictor training and testing scripts with deployment/verification instructions

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.

    REVIEW STYLE:

    Final Review:

    Community Review Board
    ?

    Approval:

    User Sign-Off
    ?

    CHALLENGE LINKS:

    Review Scorecard