Challenge Overview
Problem Statement  
Prize Distribution
Requirements to Win a PrizeIn order to receive a prize, you must do all the following:
If you place in the top 5 but fail to do any of the above, then the corresponding prize money will be given to the next ranking contestant who completes the above. Anyone from Genospace & Compass is not eligible for any prize, but they are not banned from competing. BackgroundMultiple myeloma, a cancer of plasma cells, affects hundreds of thousands of people each year. The goal of this contest is to devise an algorithm capable of predicting progression outcomes for multiple myeloma patients, based upon their genetic data. The results of the contest will provide researchers with new understanding and insights into the disease, and ultimately help researchers develop better treatments for myeloma patients. ObjectiveThe objective of this challenge is to rank order a given set of patients based on the predicted time that will elapse before disease progression based on the measured expression of 18,898 genes for each patient. FunctionsYou must implement two functions in your submission: trainingData and testingData. trainingData receives input arguments corresponding to the training set for the given test case. The function receives four String arrays. Each of these String arrays contains a single entry for each patient. Each entry in these arrays consists of a String representing a list of comma separated values. For the first three parameters (expr_avg, expr_diff, and mutation), each entry has 18,898 comma separated values (one for each gene). Each entry of the ground truth parameter (prog_obs_time) has 2 comma separated values: progression time and observation time. Input parameters:
Some values in the comma separated lists may be missing in each of these arrays. Missing values will be left blank. testingData receives the same input arguments as trainingData except for the absence of the test type (test_type) and the progression and observation times (prog_obs_time) which serves as the ground truth. Your implementation of testingData should return an integer array of length N containing a single value for each patient indicating the rankorder from 1 to N before disease progression. These returned values are used in conjunction with the ground truth in order to score your solution. Your rankorder prediction array of length N should contain one of each value from 1 to N inclusive where 1 indicates the patient most likely to progress and N indicates the patient least likely to progress. Data DescriptionExample data is only available through MMRF for download here. You will need credentials, which must be requested (response will take ~24hr), to download the example data. Credentials received before the start of the contest will be activated at the start of the contest. For each patient, 3 values are specified for each of the 18,898 genes:
Two ground truth values are also specified for each patient (prog_obs_time):
The example data is contained in 4 csv files: expressions_example.csv, copynumber_example.csv, mutations_example.csv, and groundtruth_example.csv, corresponding to expr_avg, expr_diff, mutation, and prog_obs_time, respectively. Details about file formats: what the rows and columns represent in all files
Patient data may be missing in some data sets, including the ground truth. Missing data will remain blank. Data PartitioningA total of 640 patients in the full data set are partitioned randomly into 3 subsets: approximately 50% for example, 25% for provisional, and 25% for system. Each subset is partitioned randomly into training and testing groups for each test case. For each provisional test, 65% of the provisional data and all example data is used for training while the remaining 35% of the provisional data is used for testing and scoring. For each system test, 65% of the system data and all provisional and example data is used for training while the remaining 35% of the system data is used for testing and final scoring. The minimum number of provisional and system test cases are 50 and 100 respectively. Additional test cases will be added if deemed necessary. Breakdown of missing data in the ground truth sets:
ScoringAssume an algorithm that returns a rankordered list of patients in decreasing order of likelihood of poor prognosis.
The score is based on the relative progression likelihood for all possible patient pairs. For a given test case, the test case score is given by: s = N * sum_{ij} R_{ij} w_{ij} where N = 1 / sum_{ij} w_{ij} and i, j run over all patient IDs in the test data set. The matrix R has matrix elements given by R_{ij} = sgn(r_i  r_j) which take on the values {+1,0,1} depending on the relative ranking of the patient pairs, as returned by the algorithm. For each patient pair, the weight matrix elements are given by w_{ij} = 2*P_{ij}1 , where P_{ij} represents the likelihood of the given ground truth patient pair ordering (noting that the patient progression times are drawn from an a priori unknown patient progression distribution, and thus their ordering is probabilistic). The probability matrix P_{ij} can be modeled based on reasonable assumptions about the patient progression distribution. Assuming patient risk is described by an exponential distribution function with a decay time set by the patient progression time, the weights will take a simple form, based on four possible cases:
The algorithm score is given by S = 1,000,000 * max(0, s_avg) where s_avg is the average of s over all test cases considered (assume that test cases contain at least two progressing patients). The score, as defined above, ranges between 0 (i.e., random guess solutions) and 1,000,000 (i.e., perfect prediction of ground truth test data). General Notes
Useful gene referencesThe following links may be useful for this challenge:
ReportYour report must be at least 2 pages long, contain at least the following sections, and use the section and bullet names below. Your Information This section must contain at least the following:
Approach Used Please describe your algorithm so that we know what you did even before seeing your code. Use line references to refer to specific portions of your code. This section must contain at least the following:
 
Definition  
 
Examples  
0)  
 
1)  
 
2)  
 
3)  
 
4)  
 
5)  
 
6)  
 
7)  
 
8)  
 
9)  

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.