1. Project Overview
The EPA is a U.S. federal government agency devoted to safeguarding the environment. One of the EPA's great concerns is the profileration of cyanobacterial harmful blooms (cyanoHABs) in the nation's lakes. The following resources provide information on what cyanoHABs are and how they threaten the environment.
The TopCoder project on cyanoHABs aims to develop an algorithm that will be deployed in an Android app with mapping and data visualization capabilities. The app will inform local and federal policy makers about locations where bloom events are likely to occur, allowing them to concentrate their efforts in those areas.
2. Contest Overview
The algorithm menioned above will be developed in a marathon match. For the setup of this match, a major part is the data that will be used for developing the prediction models and testing them.
We recently ran a couple of long data collection contests to collect data from different sources. This data includes various features that are expected to have effects on the cyano bloom formation.
Our next step is to unify all the available data into a single structure so that it can be presented in an easy-to-use manner to marathon members and also can be used as input to the algorithm very easily.
But the major issue that we have lot of data which requires unification and once the whole project finishes, the clients wil lstil lfurther keep receiving data which will require same kind of unification. Hence we would like to automate the process of data unification by developing a data standardization component.
The purpose of this component contest is to develop a highly efficient component which reads data arranged different structures/granularities and unifies them into single structure.
Details of input and output:
We have two types of input data:
- Image Data : GeoTiff images which has geo-encoded information
- Text Data: Excel files containing data.
The data is in very large amounts. For this contest we will provide you sample files of each dataset in forum and your job will be to build a component that automatical converts them into data set of single structure. The reviewers will then use the full data set to test your components.
Please Note: The images will always be GeoTiff. Various libraries are available to read data from GeoTiff images. Once such reference is: gdal Library . You can use any other library of your choice. Please get confirmation in forums. Also, the library must have open sourcelicense and free to be used in component and for future purpose.
The output file format can be any of .csv or .ods or .xls
We want the output data format to be:
year (int), dayOfYear (int), latitude (double), longitude (double), [daytime in hours (double), optional ], other data (list of doubles, separated by "," and with "." as floating point delimiter)
The other data parameter in above format can be more than one columns as per the avilable data. It must contain all the features presented in data file.
Please note the following points:
- An important part is that the authors of the conversion libraries should document in which order and which base unit the other data will be encoded.
- In general all data saves should be in SI-derived units ( Please use ml for cyano cell counts instead of liter as that will cause very huge numbers)
- As for the image files all points that don't have usable data (e.g. clouds in the images) should be left out.
- All other non available data should haven a "n.a." at the place where normally the double value is located.
- In the image files, always use center of the pixel for position data.
- For every non numeric value that may be of need - like the crops grown at a certain place - the value should be replaced by an integer representing the value. For each variable encoded in this way a table describing the one to one correspondence between the values and the replacement should be given.
- Most string appearing in the data like a position description (e.g. "Lake x near village y") can be ignored and should just be left out as long as there are latitude and longitude available.
- If there is any data that is not clear please ask in forums.
3. Technology Overview
You are allowed to use any language to develop this compoenent. As you can see, it is going to process a large amount of data. Hence we would like it to be as fast, efficient and accurate as possible.Please use all your coding skiils to optimize this component.
4. Documentation Provided
The following documents have been provided in the forums. You will be able to access it after registration:
Design Document (The design document is fully updated -- please follow it if there is any difference with the above requirements)
Sample Image data files (Can be found in design document thread - data samples)
Sample Text data files (Can be found in design document thread - data samples)