Living Progress - Data to Drops - Keyword Categorization

Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Project Overview

Millions of people around the world rely on water points for their daily existence. Too often the water points fail and communities are left without the water they desperately need. A lack of basic information on these failures has made keeping water flowing a major challenge for governments and aid agencies. Across the globe, many communities have come to rely on public water access points. People will come to these water points and open the tap or pump the handpump to fill their containers. As long as the water is flowing, they will carry the water home, and use it for sustenance: drinking, cooking, cleaning, bathing, and more. The water from these points are a critical foundation for success, health, and prosperity.

Unfortunately, these water points systematically fail due to technical breakdowns, water scarcity, vandalism, and misuse. When these water points fail the very foundation for community wellbeing fails. People have to revert to distant water sources, dirty water, or exorbitant prices. Better understanding the causes of failure will allow NGOs and governments to better avoid these failures, ensuring that water services last over time.

The recently launched Water Point Data Exchange (WPDx) has made significant progress in analyzing these failures and establishing a path forward to lasting services. WPDx consists of a data exchange standard and a central repository of compliant data. The water point data is aggregated from governments, NGOs, academia, and other sources and then standardized for integration into the central repository. This unprecedented library of information is already providing a foundation for improved research and effective policies to help keep water flowing.  The major limitation of WPDx is the presence of several open text fields among the standardized attributes. These fields (such as water point status, and water point type) allow for much needed flexibility, but severely curtail analysis. This solution will provide secondary processing on the WPDx data to convert those open text values into meaningful categories that allow for analysis.

This challenge is part of the 
HPE Living Progress Challenge Blitz Program (Secure top placements in the leaderboard to grab additional cash prizes)

Competition Task Overview

Before diving into more sophisticated analysis and categorization mechanisms we’re going to develop a simple keyword matching solution to clean and tag the data from three fields in the raw WPDx data set.  The data is well-suited to this type of tagging so we want to use this mechanism as a baseline.  Your solution should try to deal with misspellings and should be case-insensitive.   Please do not implement a Bayesian algorithm or attempt a neural network.  Regular expressions are fine though. We’ve selected a set of 4000 records from the whole dataset which will allow you to validate your solution.  

 

The data can be accessed here:  https://drive.google.com/file/d/0ByjxTGykXQjAa1FCRV9BNEYyQjA/view?usp=sharing

 

We’ll test the solutions with approximately 1000 records which are not provided to you in advance.   The solutions will be evaluated for accuracy.  Fifty percent of the score for the submissions will be based on the accuracy of your categorization efforts.  The accuracy metric is fairly simple:

 

Accuracy = # of correct responses/# of total responses

 

What we’re doing in this challenge is mapping the values in the #water_tech and #water_source fields to a cleaned and distinct set of water source types.  The source data is quite messy and the water source info may be found in either field.  The strategy for dealing with this is typically just to concatenate the fields together and search the concatenated string for possible matching keywords.  It won’t be possible to categorize every field.  There are null values even in the training data.

 

Here is the list of water source types you’ll be mapping from the #water_source and #water_tech fields:

Borehole or tubewell

Null

Piped into public tap or basin

Piped water to yard/plot

Protected dug well

Protected spring

Public tap or standpipe

Rainwater

Surface water

Unprotected dug well

Unprotected spring

Additional Requirements

1. You should use Python 2.7 to complete this application.  
2. Please name your Python script keyword.py.  
3. The script should take two command line parameters.  The 1st parameter should be the file path of the input file.  The 2nd is the the file path of the output file.
4. The output file format should be exactly the same as in the input file with one additional column:  “Water Source Types”.

Technology Overview

Linux
Python 2.7
Data Science


Final Submission Guidelines

Submission Deliverables

1. Please submit all code required by the application in your submission.zip
2. Document the build process for your code including all dependencies (pip installs etc..)

ELIGIBLE EVENTS:

2016 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30054204