Coverage Optimization- Combine and Refine Tools








    The challenge is finished.
    Show Deadlinesicon-arrow-up

    Challenge Overview

    Challenge Objectives

    • Develop a tool to combine the results of two prior challenges

    • Create simple reports in Excel


    Project Background

    • Our client, Morel,  wants to optimize their products offerings by consolidating their current products set.

    • In prior challenges, we built two tools

      • The first merges related coverage codes and suggest coverage codes hierarchy to create a consolidated product set.

      • The second worked on a different subset of the requirements - it is a tool that cleans the raw benefits data, parametrizes and groups related benefits and outputs simple Excel reports. This effectively reduces the variability of the benefits data

    • This challenge will integrate the outputs of these two challenges, improve reporting and create the final product set recommendations



    Our client offers a variety of insurance products to large organizations who opt for the one that best suits their needs. This is often achieved through customization of product details and these unique products are then administered and managed. This has resulted in the following challenges:

    • Variation of processes across the organization

    • Increased complexity in providing customer service

    • Reduced ability to self-serve

    • An inefficient process of claims payment and pricing models

    • Difficulty in forecasting and providing optionality to customers


    The goal of Coverage Optimization is to:

    • Analyze the variability of parent products with reference to the unique child products formed due to benefits customizations. This analysis will be used to assist Morel in the definition of a target, consolidated product set that can represent a standard set of offerings with ‘configurable’ options

    • Recommend options for this envisioned consolidated product set along with justifications. This will include impact analysis based on utilization and current state product variability

    • Suggest a hierarchical and categorical benefit design with measurable reduction in variability of benefit language, resulting in standardization of offerings and optimal benefit package

    • Increase simplicity of standard benefit options while suggesting configurability options for benefits

    • Offer an applicable algorithm/process that is easily repeatable on similarly structured but different data that

      • Identifies new answers/benefits and aligns them to the suggested benefits appropriately, or

      • Signals the need for a new standard benefit option


    Data Description

    The available dataset is not very large (~70MB). You will have access to the entire data set. Check the sample and metadata file available in the forums for a complete definition of all data fields.


    Here is a definition of some terms that will help with understanding the provided data:

    • Benefit is coverage for various health care services

    • Coverage code is a unique identifier for a set of benefits provided as a group of services with actual start/end dates for the coverage

    • Product consists of a set of coverage codes (and hence benefits) and is used internally to align coverage codes to internal rules and procedures.


    Benefits data is the core data set for this challenge. Here is how this data is generated. Benefits configuration is arranged in a question/answer format on a website. The benefit has a hard coded question, and then several types of available answers to round out the question.  For example, a question might be “Is this is High Deductible Health Plan?” and the user might have a choice of 2 check boxes, radio buttons, or a drop down with Yes/No toggles. The answer then becomes the statement combination of the Q&A, leaving “No, this is not a high deductible health plan.”  Another examples the Question might be “The out of pocket maximum is:” and the user enters “$3000”, leaving the answer to be “The OPM is $3000.” Or finally the Answer might be “Enter additional comments here” in which the user might enter free form text and that free form text becomes the answer.  These sets of answers are rolled up to a form all of the benefits for a specific coverage code.


    The actual data set contains flat data records that:

    • List the benefits (the answer column) and question identifiers (sequence_id)

    • Connect benefits to coverage codes

    • Connect coverage codes to internal products

    • Start/End date for the coverage

    And also these useful columns:

    • Type_of_tag - information about the type of field presented in the software - radio button, checkbox, text input, dropdown

    • Value flag - this is only populated for records where type of tag is text - It denotes whether the user entry field is a text field (meaning all open free form text allowed) or it is a numeric field, meaning the answer may contain some text that is automatically generated by the software, but the user can only enter a numeric value.

    • Top50_flag - This just denotes that the coverage code is for a very important client


    Technology Stack


    • Python

    • Excel


    Code access


    We have attached the two winning submissions that you are to combine in the Resources post in Forums. 

    Winning submission of the ideation challenge is also available in the forums - it contains the details of what we’re trying to build in this project. You should read that document to get a better idea of our overall goal.


    Individual requirements

    The value we are seeking is the ability to better break down and compare the information contained primarily in the text, but along with the other formats (radio & dropdown) to inform a configuration structure as defined in the original winning submission.

    In this challenge, we will focus on consolidating the Variability Reduction and Frequency Analysis pieces. Ultimately, expectations are that future results would further break the text components into a set of simplified recommendations of statements, sub-statements, and levels or open text field entry suggestions.


    The output of this challenge is a Python tool (CLI) that includes:

    1. Merging the two tools
      Both tools take the raw benefits data file as input, and produce the modified benefits file as output. Variability reduction tool modifies the benefit answer text by cleaning the data and parametrizing common text, while frequency analysis tool modifies coverage codes - trying to merge coverage codes with similar benefits into one coverage code.
      Goal here is to create a CLI tool that has three commands:

    • Clean (cleans the benefits data)

    • Variability reduction (analyzing the answer texts)

    • Frequency analysis (merging coverage codes)

        All 3 commands should take input and output file name as parameters and any additional command line parameters needed for each of the steps. The end user will chain the commands by feeding the output of one command to the next one, ex
        “python claimsCoverage.py -clean -input benefitsRaw.csv -output benefitsClean.csv”

        “python claimsCoverage.py -variabilityReduction -input benefitsClean.csv -output benefitsReduced.csv”

        “python claimsCoverage.py -frequencyAnalysis -input benefitsReduced.csv -output benefitsCombined.csv”


    1. Improve benefit analysis for textual answer types
      Current variability reduction analysis provides good results in identifying parameters in benefits text that are added through radio buttons, checkboxes or dropdowns - and that is the easy part since all such benefits have the same structure with just placeholder values filled in from the form input. What we want to focus in this challenge is textual answer benefits (type_of_tag=T and value_flag=text). Answer text for those benefits varies significantly and most of the time looking for placeholder values won’t yield any useful results since the text is formulated differently, ex

    • using synonyms,

    • changed word order or skipped words (“additional copay information: mhcd urgent care copay is equal to the pcp copay amount:  $25” vs “additional copay information: mhcd urgent care copay is equal to the pcp copay: $25”),

    • adding multiple statements in the benefit text (ex “acupuncture is covered - benefit period limit: 25 visits..waive copay for outpatient mhcd services”). These clauses can even be split into multiple benefits during data cleaning phase).

    • Some benefits have nearly identical text, but mean very different things in the insurance domain (ex “prescription drugs are paid at the ppo level” vs “prescription drugs are paid at the non ppo level”)

        To summarize, the goal here is to update the variability reduction step with improvements to handling the textual benefits type and trying to determine answers that have different language, but are:

    • The same (same meaning, different text) - we can just use one version here

    • Are the same except for some placeholder value (similar meaning, different text, some placeholder value like dollar amount, or yes/no value) - we can also use one version of the benefit here (with “param” placeholder) and specify which values are used for the parameters. These benefits will be converted to dropdown/radio button/checkbox questions in the future.

    • Cover similar area, but have totally different meaning and a free form text field will still be required (ie, we can’t formalize the benefit into dropdown/checkbox options)


    The main requirement here is combining the two tools into one that contains the functionality of both. There is no objective score/metric for benefits grouping that we can use here for review so the reviews will be manual and based on the output of your algorithm.


    Create a README file with details on how to deploy and verify the tool with end to end steps data processing (raw data to frequency analysis). Unit testing is out of scope. Make sure your code follows the PEP-8 guidelines and is split into modules - don’t put everything into one giant module.

    Final Submission Guidelines

    Submit the full source code with documentation

    Submit the build/verification documentation

    Submit a short demo video and sample outputs of the tool

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.


    Final Review:

    Community Review Board


    User Sign-Off


    Review Scorecard