Living Progress - Data to Drops - Python Learning and Classification

Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Background

Millions of people around the world rely on water points for their daily existence. Too often the water points fail and communities are left without the water they desperately need. A lack of basic information on these failures has made keeping water flowing a major challenge for governments and aid agencies. Across the globe, many communities have come to rely on public water access points. People will come to these water points and open the tap or pump the handpump to fill their containers. As long as the water is flowing, they will carry the water home, and use it for sustenance: drinking, cooking, cleaning, bathing, and more. The water from these points are a critical foundation for success, health, and prosperity.

Unfortunately, these water points systematically fail due to technical breakdowns, water scarcity, vandalism, and misuse. When these water points fail the very foundation for community wellbeing fails. People have to revert to distant water sources, dirty water, or exorbitant prices. Better understanding the causes of failure will allow NGOs and governments to better avoid these failures, ensuring that water services last over time.

The recently launched Water Point Data Exchange (WPDx) has made significant progress in analyzing these failures and establishing a path forward to lasting services. WPDx consists of a data exchange standard and a central repository of compliant data. The water point data is aggregated from governments, NGOs, academia, and other sources and then standardized for integration into the central repository. This unprecedented library of information is already providing a foundation for improved research and effective policies to help keep water flowing.  The major limitation of WPDx is the presence of several open text fields among the standardized attributes. These fields (such as water point status, and water point type) allow for much needed flexibility, but severely curtail analysis. This solution will provide secondary processing on the WPDx data to convert those open text values into meaningful categories that allow for analysis.

This challenge is part of the HPE Living Progress Challenge Blitz Program (Secure top placements in the leaderboard to grab additional cash prizes)

Requirements

In a previous challenge, the Topcoder community developed a fascinating array of algorithms to categorize a provided set of water source and water technology values.  Many of the algorithms performed extremely well and some even categorized the unseen data in our testing data set perfectly.  However, when the top solutions were tested against a broader set of inputs the accuracy of the solutions fell to more like 75-80%.  This is totally understandable, the solutions were tailored to the data we’d provided.   In this challenge however, we’re going to take this challenge to next step.  The WPDx will continue to receive records from new sources and we’d like to incorporate the learning process itself into the applications provided.  Your job in this challenge is two-fold.   We still want you to develop and refine the classification algorithms submitted in the previous challenge.   But first, we would like you to automate the “learning” process.  The suggestion here is that you submit two Python scripts.  The first to perform weighting, association analysis, or branch analysis, etc.  The second is to do the actual classification.  We’ll provide 2516 records from the whole dataset which will allow you to validate your solution.  

The data can be accessed here:  
https://drive.google.com/file/d/0ByjxTGykXQjAU2w2VVpwRlpzWnc/view?usp=sharing

We’ll test the solutions with the data above plus 5000 records which are not provided to you in advance.   The solutions will be evaluated for accuracy.  Fifty percent of the score for the submissions will be based on the accuracy of your categorization efforts.  The accuracy metric is fairly simple:

Accuracy = # of correct responses/# of total responses

What we’re doing in this challenge is mapping the values in the #water_tech and #water_source fields to a cleaned and distinct set of water source types.  The source data is quite messy and the water source info may be found in either field.  The strategy for dealing with this is typically just to concatenate the fields together and search the concatenated string for possible matching keywords.  It won’t be possible to categorize every field.  There are null values even in the training data.

Here is the list of water source types you’ll be mapping from the #water_source and #water_tech fields: 

Borehole or tubewell
Null 
Piped into public tap or basin
Piped into yard/plot
Protected dug well
Protected spring
Public tap or standpipe
Rainwater
Surface water
Unprotected dug well
Unprotected spring

Additional Requirements

- You should use Python 2.7 to complete this application.  
- Please name your training Python script training.py.  
- Please name your classification Python script classification.py.  
- The classification.py script should take two command line parameters.  The 1st parameter should be the file path of the input file.  The 2nd is the the file path of the output file.  
- The output file format should have four columns Row ID, #water_source, #water_tech, and  “Water Source Type”.  Water Source Type is the category you are assigning.
- The training and test files are in csv format.  Your app should be able to read and write this format.
- Your training method should not require manual intervention (e.g. coding if/else statements in the Classification.py file) beyond the execution of the training.py script itself.

Final Submission Guidelines

Submission Deliverables

1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..)
3. Provide instructions on how to execute your application.

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30054554