Welcome to the follow-up round of EPA ToxCast LEL Prediction Challenge Marathon Match. After a tight competition in the MM, this is the time when we welcome you to showcase the scientific significance of the solutions and algorithms that you designed and implemented during the Marathon Match. As this part of your solution will help understand the overall contribution of data in the computation and prediction of LEL, it is very important to us and clients and hence we have three big prizes to be won.
1. ToxCast Project Background
In 2005, the federal government launched Tox21, an initiative to use in vitro high-throughput screening (HTS) to identify what proteins, pathways, and cellular processes chemicals interact with and at what concentration they interact. The goal is to use the screening data to more cost-effectively and efficiently prioritize the thousands of chemicals that need toxicity testing and, in the future, predict the potential human health effects of chemicals. Tox21 currently pools the resources and expertise of EPA, National Institutes of Environmental Health Sciences/National Toxicology Program, National Institute of Health/National Center for Advancing Translational Sciences, and the Food and Drug Administration to screen almost 10,000 chemicals.
One of EPA’s main contributions to Tox21 is the Toxicity Forecaster or ToxCast for short. The first phase of ToxCast was designed as a "proof-of-concept" and was completed in 2009. It evaluated approximately 300 chemicals in over 500 biochemical and cell-based in vitro HTS assays. The 300 chemicals selected for the first phase were primarily data rich pesticides that have a large battery of in vivo toxicity studies performed on them. Data collection on the second phase of ToxCast was completed in 2013. The second phase evaluated approximately 1,800 chemicals in an expanded set of over 700 biochemical and cell-based in vitro HTS assays. The 1,800 chemicals were from a broad range of sources, including industrial and consumer products, food additives, and potentially "green" substances that could be safer alternatives to existing chemicals. These chemicals were not as data rich as those selected for the first phase and many do not have in vivo toxicity studies. The in vitro data are accessible through the interactive Chemical Safety for Sustainability Dashboard (iCSS) and raw data files are also posted to the Dashboard web page.
In addition to the in vitro HTS data, the EPA has created a complementary Toxicity Reference Database (ToxRefDB), which comprehensively captures results from in vivo animal toxicity studies. ToxRefDB provides detailed chemical toxicity data from over 30 years and $2 billion in animal testing, in a publicly accessible and computable format.
2. Contest Overview
The purpose of this contest is to understand how much your model is able to capture and use the biological and scientific aspect of the information provided by the data. We want to understand how the scientific aspect of the data contributed to the predictions of the toxicity level. This would help us to answer larger questions like how well these machine learning models would serve and how reliable these predictions can be considered in real-world scenarios dealing with biological topics like toxicity level.
Hence, in this contest, we would like you to submit a document explaining various aspects of your solution and thereby addressing several questions like:
- Was your model completely numerical-driven? Or did it consciously use some scientifically relevant information?
- What kind of feature combinations did you use? Can you explain the reason for your choices and how they fared better than other features?
- Can you propose a way to evaluate the impact of these meaningful feature combinations separately by using your code?
- What kind of correlations were observed between data available in different files?
- If your model was completely numerical-driven (that is just giving good prediction based on values), why was it difficult to find the correlation between semantics of various columns?
In addition to the above answers, we would also like you to submit:
- Algorithm Documentation: Please describe in detail the algorithm used for prediction of LEL values.
- The complete source code that was used locally to compute the predictions.
- Also, submit a separate file with the final predicted values for all 1854 chemicals. (this will be same as your final submission on which system test was scored)
3. Submission Scope
1.) Explain your final submission (which contributed to their final provisional and system test score) completely.
2.) Explain any other provisional submission which they feel could have performed better and might have higher final score than current one.
Please Note: As a part of the submission, please submit a short “abstract” for the submissions - just listing the significant and meaningful “predictors” in a “twitter”-format bullet list. These abstracts will be used to narrow down the list of submissions for client review in the case of large number of submissions. So, please consider this important.
4. Review and Prizes
- The contest will be reviewed by the clients and the results will be based on how they feel your solution captures the scientific aspect of the data.
Please Note: The prizes in this contest will be awarded independent of your position in the match. Hence if you did not achieve top 4 score in MM but your submission is deemed scientifically more relevant, then your submission has a great chance to win a prize in this contest.
The top three submissions in this contest will be awarded following prizes based on how much scientific relevance the solutions have.
- 1st place: $2,200
- 2nd place: $1,300
- 3rd place: $900
*The contest page shows only two prizes but we will award three prizes as described above.
* The winners of this contest will transfer the rights on their solutions to TOpCoder so that it can be shared with clients.