Challenge Overview

Our client is an insurance company that currently provides its potential customers with forms, which the customers fill to describe the insurance options they want to include in their policy. Currently, there are a lot of redundant and unstructured options listed in the form, along with a lot of text input options. This makes it difficult for the customers to easily find and select the options. Furthermore, the insurance company also loses out on the opportunity to help the customer in easily finding standard or recommended list of options that most people might find useful.
In the near future, the client is looking to build a simplified insurance options (benefits) configurator, where the customer can go ahead and select options according to their needs in a hierarchical manner. Example of the intended hierarchy (example only for general idea, not exactly based on the provided dataset):
- Q1 - Coverage for transplants?
  - Q1.1 - Kidney Transplant coverage?
    - Yes - 100% coverage
    - Copay - 20%
    - Copay - 10%
    - No
  - Q1.2 - Knee Transplant coverage?
    - Yes - 100% coverage
    - No
The client has provided us with anonymized past history configuration data, which contains the choices made by customers on the unstructured options discussed above. It should be noted that in this data, the customer was presented with a list of unstructured options, and NOT the structured and hierarchical questions format described above.
The task of this challenge is to discover patterns in the provided data, and based on the analysis create a report in Excel, JSON or CSV format suggesting a hierarchical set of questions and sub-questions that the client can use to build a more simplified question and sub-question based configurator, and also suggest good standard recommendations to the future clients.
To find this hierarchical distribution of questions and sub-questions, various patterns can be explored such as looking for common options that a lot of people buy, combinations of options that are ‘commonly bought together’, and also dependent structurings, such as “only the people who usually buy A or K, tend to buy C and E”.
Apart from looking for patterns like above, popular and advanced techniques such as clustering, hierarchical clustering as well as classical or Deep Learning based NLP methods can also be explored.
After the final set of hierarchical questions and sub-questions are discovered, create a report linking each row in the provided raw dataset to the most relevant sub-question that it should be associated with.

Individual Requirements

Pre-process the provided data to generally disambiguate the data such that minor difference between answers such as syntactical differences or differences arising from numeric placeholder type numeric inputs are filtered out. For example, answers like “Copay is 10%” and “The Copay: 20%”, should be considered the same concept. Furthermore, the raw data also contains rows that contains more than one answers in a single row. Such answers should be first converted to multiple unique rows, which can then be preprocessed as usual. The overall goal of this step is to reduce the variability in the raw dataset, so that same or similar concepts written in different natural language or other kinds of syntax are combined into a single representation, such that the processed data is ready for analysis.
After initial pre-processing, analyze the provided dataset to understand the configuration pattern of insurance options in past purchases. The free-form text input rows contain add maximum ambiguity to the dataset, and hence are the primary target for analysis. Rows of other input types such as check-boxes, radio buttons etc are relatively easy to consolidate, but should be considered in the analysis to discover any latent information that they could provide.
Create a tabular report in Excel, JSON or CSV format suggesting an actionable set of hierarchical questions and sub-questions, which can be later used to build a UI-based configurator.
Create a report linking each row in the provided dataset to the sub-question or sub-category that it is best associated to.

Data and additional code access

The Data is available in the forums. In addition, the winning submissions of previous challenges attempting to solve the problem have also been included in the forums.

Important - You are free to start from scratch, or choose to use the provided submissions and extend/modify them to achieve the objectives of this challenge. Use of relevant techniques from the field of NLP, Deep Learning and machine learning techniques such as Clustering and Hierarchical Clustering is encouraged, but NOT mandatory.

A note to previous participants in this series - This challenge is a part of the Coverage Optimization challenge series. It should be noted however, that the requirements mentioned in this challenge are considerably similar to the previous challenges, but there can be some differences. The current requirement set alone should be considered and if required, changes should be made to the previous winning submissions to make them suitable for this challenge, before they’re used further extended/modified for this challenge.

Data Description

The available dataset is not very large (~70MB). You will have access to the entire data set. Check the sample and metadata file available in the forums for a complete definition of all data fields.

Here is a definition of some terms that will help with understanding the provided data:

Benefit is basically an insurance option, which describes the coverage of a particular health care service by the insurance provider.
Coverage code is a unique identifier for a set of benefits/options provided as a group of services with actual start/end dates for the coverage
Product consists of a set of coverage codes (and hence benefits) and is used internally to align coverage codes to internal rules and procedures.

Benefits data is the core data set for this challenge. Here is how this data is generated. Benefits configuration is arranged in a question/answer format on a website. The benefit has a hard coded question, and then several types of available answers to round out the question. For example, a question might be “Is this is High Deductible Health Plan?” and the user might have a choice of 2 check boxes, radio buttons, or a drop down with Yes/No toggles. The answer then becomes the statement combination of the Q&A, leaving “No, this is not a high deductible health plan.” Another examples the Question might be “The out of pocket maximum is:” and the user enters “$3000”, leaving the answer to be “The OPM is $3000.” Or finally the Answer might be “Enter additional comments here” in which the user might enter free form text and that free form text becomes the answer. These sets of answers are rolled up to a form all of the benefits for a specific coverage code.

The actual data set contains flat data records that:

List the insurance options/benefits (the 'answer' column) and question identifiers (sequence_id)
Connect benefits to coverage codes
Connect coverage codes to internal products
Start/End date for the coverage

And also these useful columns:

Type_of_tag - information about the type of field presented in the software - radio button, checkbox, text input, dropdown
Value flag - this is only populated for records where type of tag is text - It denotes whether the user entry field is a text field (meaning all open free form text allowed) or it is a numeric field, meaning the answer may contain some text that is automatically generated by the software, but the user can only enter a numeric value.
Top50_flag - This just denotes that the coverage code is for a very important client

Note - All the columns should can be considered in your analysis. In particularly, the rows with ‘Type_of_tag = text input’ are one of our primary and most difficult targets for analysis, and it is surmised that NLP or machine/deep learning based techniques can be used to perform better pre-processing of these free text columns to achieve better disambiguation. Before any analysis in purchase patterns is done, it should be ensured that the preprocessing is done thoroughly because the raw dataset contains a lot of rows that effectively mean the same but have minor syntactical and numeric differences.

It should be noted that the quality of the results of this ‘variability reduction’ pre-processing will be weighed heavily towards assigning the final subjective score, as well the the quality of the reports generated and the quality of the code.

Final Submission Guidelines

What to Submit

A report detailing the techniques and algorithms used in the analysis of provided data to discover purchase patterns in the data and to discover the simplified configuration hierarchy of Insurance options/benefits. The format should be in PDF or Word or Markup
A report depicting the simplified configuration hierarchy of Insurance options/benefits preferably in the questions and sub-questions format or topic and subtopic format. Any additional visual representation of the hierarchy will be useful but is not essential. The format should be CSV or Excel or JSON.
A file describing the mapping of each row in the provided raw dataset to the most relevant sub-question of the simplified configuration hierarchy, that it is best associated with. If a particular row, contains multiple insurance options, it should be first broken down into multiple rows in the pre-processing step, before this report is generated. The format should be CSV or Excel or JSON.
Command-line based script(s) to create the above reports and files. (Python code should be targeted to Python 3.6+ or Python 3.7+ and should be generally PEP-8 compliant format, although strict compliance is not essential)
Deployment guide with clear verification instructions.

Hierarchical Simplification of Insurance Benefits Challenge

Key Information

Challenge Overview

Individual Requirements

Data and additional code access

Data Description

Final Submission Guidelines

What to Submit

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30094958