EPA PM - Algorithmic Prediction Challenge Marathon Match Problem Statement







    The challenge is finished.
    Show Deadlinesicon-arrow-up

    Challenge Overview


    The Environmental Protection Agency (EPA) has asked TopCoder to develop an algorithm that predicts the occurrence of cyanobacterial blooms in U.S. lakes. We will do so by running a Marathon Match. Your task in this contest is to write a problem statement for the Marathon Match based on the available data sources.

    Project background

    The EPA is a U.S. federal government agency devoted to safeguarding the environment. One of the EPA's great concerns is the profileration of cyanobacterial harmful blooms (cyanoHABs) in the nation's lakes. The following resources provide information on what cyanoHABs are and how they threaten the environment.

    The TopCoder project on cyanoHABs aims to develop an algorithm that will be deployed in an Android app with mapping and data visualization capabilities. The app will inform local and federal policy makers about locations where bloom events are likely to occur, allowing them to concentrate their efforts in those areas.

    Data sources

    The EPA has provided us with two sets of cyanobacterial data spanning the time period from February 2009 to April 2012. One data set is synthetic and the other is empirical.

    The synthetic data set is the MERIS-derived estimates of cyanobacterial concentration. We will refer to these as the MERIS estimates. This data is provided as a sequence of image files covering three regions of the United States:

    • New England
    • Ohio
    • Florida

    These images were derived from satellite photographs by applying an experimental formula to estimate the concentration of cyanobacteria in each 300-by-300-meter area of the covered region. A sample image for each region is attached to this contest specification.

    The empirical data set is called the onsite measurements of cyanobacterial concentration, which we will call the field measurements. This is a time series of field measurements taken at various locations within the same time span and the same regions covered by the MERIS estimates. The temporal and spatial coverage is very sparse. However, the field measurements are valuable because they are the only empirical readings that we can use to confirm the MERIS estimates.

    In addition to the cyanobacterial data, we have several sets of data covering the same regions.

    • Weather data: daily readings of temperature, air pressure, and other meteorological measurements
    • National Land Cover 2006: a one-time survey of land usage (residential, industrial, agricultural)
    • CropScape 2009-2012: annual surveys of what crops were cultivated in agricultural areas

    The weather data is quite coarse, describing cells covering an area of one degree of latitude by one degree of longitude. The National Land Cover and CropScape data sets have a high resolution equaling that of the MERIS estimates. Agricultural data is important because the runoff from fertilizer use is the principal contributor to cyanobacterial growth.

    Prediction goals

    The EPA has defined four levels of cyanobacterial concentration:

    • Low: 10,000 to 109,999 cyanobacterial cells per milliliter
    • Medium: 110,000 to 299,999 cells / mL
    • High: 300,000 to 999,999 cells / mL
    • Very High: 1,000,000 cells / mL or higher

    The EPA's goals are to predict the following events at intervals of 7, 14, and 28 days into the future:

    Areas of Low level reaching higher levels, especially Very High
    Areas that formerly had a negligible level now reaching the Low level
    Areas of Very High cyano concentration either persisting or declining

    According to the EPA's calculations, the Low and Very High readings in the MERIS estimates are accurate and the readings at the intermediate levels (Medium and High) are not.

    What to submit

    In this contest, we are looking for the detailed ideas on how to conduct this marathon match from different aspects. This contest is NOT focused on just the creation of problem statements.

    Please submit a document containing one or more Marathon Match contest ideas and problem statements. You must describe the input, output, and scoring formula to be used in the match. Also, describe in detail how you would like to conduct this marathon match and what setup will be necessary going into the marathon match. List all the components that you think needs to be built or data setups that needs to be done to conduct a successful marathon match for this contest.


    If you are submitting several different ideas, please label them A, B, C, and so on.
    In addition to writing prospective problem statements, you may add a section in which you describe the problems of coming up with the ideas and how you tried to address them. You may explain your decisions, offer alternate choices, and suggest further ways to improve the match.

    Final Submission Guidelines

    File format

    • Any widely supported document format will be accepted, such as OpenDocument, RTF, HTML
    • If your document contains embedded images, please enclose the image files separately as well
    • Submissions will be evaluated by the client and TopCoder personnel.

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.


    Final Review:

    Community Review Board


    User Sign-Off