Challenge Overview
Problem Statement  
Overall Prizes
SubContest Prizes (Per each Subcontest  4 total)
BackgroundThe Healthy Birth, Growth, and Development (HBGD) program addresses the dual problem of growth faltering/stunting and poor neurocognitive development, including contributing factors such as fetal growth restriction and preterm birth. The HBGD program is creating a unified strategy for integrated interventions to solve complex questions about (1) life cycle, (2) pathophysiology, (3) interventions, and (4) scaling intervention delivery. The HBGD Open Innovation platform was developed to mobilize the global ���unusual suspects��� data science community to better understand how to improve neurocognitive and physical health for children worldwide. The data science contests are aimed at developing predictive models and tools that quantify geographic, regional, cultural, socioeconomic, and nutritional trends that contribute to poor neurocognitive and physical growth outcomes in children. The solutions developed by this challenge will support the efforts of the HBGD Open Innovation initiative. ObjectiveThe goal of this contest is to develop flexible methods that are able to adaptively fill��in, back��fill, and predict timeseries using a large number of heterogeneous training datasets. The data is a set of thousands of aggressively obfuscated, multivariate time��series measurements. There are multiple output variables and multiple input variables. For each timeseries, there are parts missing. Either individual measurements, or entire sections. Each timeseries has a different number of known measurements and missing measurements, the goal is to fill in the missing output variables with the best accuracy possible. How the missing input variables are treated is an open question, and is one of the key challenges to solve. This problem, unlike many data science contest problems, is not easy to fit into the standard machine learning framework. Some reasons that this is the case:
The goal of this contest is twofold: we wish not only to obtain a very good, flexible solution, but we would also like to encourage competitors to try a diverse set of approaches, and have documentation of which ones worked and why. Ideally, we would like competitors to try methodologies outside of the standard scope of contest algorithms. While many contests use Random Forests and Gradient Boosted Regression trees, we would like competitors to branch out and try recurrent neural networks, Gaussian Process Regression, polynomial regression, VARMAX processes, and other approaches. In facilitating this goal, there will be several subcontests running parallel to the main contest. Each of these subcontests will focus on a particular type of solution approach, and will have additional prizes (in addition to the primary prize pool for the overall contest). Competitors should include a brief explanation of how their solution fits into the framework of one of the subcontests, where applicable: Mixed Effects Models Linear and nonlinear mixed effects models are appropriate for our data as we have discrete subjects for which we'd like to have discernable models. Further, it would be great to have interpretable estimates for the effects of each nominal variable. For solid implementations, see the lme4 and nlme packages in R. Key challenge: linear models are insufficiently expressive for human growth data, but it is tricky to extend nonlinear mixed effects models to the multivariate case. Neural Networks Deep neural networks and recurrent neural networks (RNNs) are of great interest because of the amount of empirically successful research that has recently emerged, suggesting that these types of models have potential to revolutionize many other computational fields. RNNs are of particular interest as they have the capacity to model variable length inputs, and deal natively with multiple outputs. For interesting recent work see: https://arxiv.org/abs/1606.04130 There are many good RNN implementations for Python in Theano, TensorFlow, Keras, and other packages. Key challenge: RNNs implicitly assume that the inputs are regularly sampled, but our data is both sparsely and irregularly sampled. TreeBased Models Random Forests and Gradient Boosted Decision Trees are some of the most popular machine learning models. Even though our data is not a precise fit to the independent and identically distributed vectorsoffeatures model underlying classic supervised learning, predictive models can still be fit and evaluated to good effect. The xgboost package and scikitlearn have great implementations of these types of models. Key challenge: treebased models do not perform well when dealing with highcardinality nominal variables, but these variables (such as subject id) provide key information that is necessary for good predictions. Matrix Completion Models Although not an obvious approach for timeseries data, we can use matrix completion methods to address the sparsely sampled nature of our data. For example, instead of users and items for rows and columns, consider using subjects and time (days) for rows and columns. LightFM and libFM are two good packages to consider here. Key challenge: integrating the sideinformation for each row and each column. Note that competitors are free to submit different solutions in multiple subcontests and/or for the main contest as well, and can win prizes in more than one. Data DescriptionThe training and test data contains several columns: +++ Column#s  Column Name(s)  Data Type  Description +++ 13  y1, y2, y2  Float  The three dependent variables to be predicted in test +++ 4  STUDYID  Integer  +++ 5  SITEID  Integer  +++ 6  COUNTRY  Integer  +++ 7  SUBJID  Integer  +++ 8  TIMEVAR1  Float  +++ 9  TIMEVAR2  Float  +++ 1039  COVAR_CONTINUOUS_n  Float  (30 fields) +++ 4047  COVAR_ORDINAL_n  Integer  (8 fields) +++ 4855  COVAR_NOMINAL_n  Char  (8 fields) +++ 5658  y1, y2, y3 missing  True/False  (3 fields) does the value exist in ground truth +++ The combination of STUDYID and SUBJID is sufficient to uniquely identify a specific individual. Adding TIMEVAR1 is sufficient to identify to uniquely identify each row. The validation and test data file contains the same fields as the training data, with one primary difference. y1, y2, and y3, aren't given, and are left empty. The last three columns contain the values ���True��� or ���False��� indicate whether y1, y2, or y3 is missing from the ground truth data. This test data file contains the tests for both provisional and system testing, however you will not know which set each row belongs to. Any given subject/study pair belongs entirely to one or the other. The submitted predictions should contain one row for each row in the test data set. Each row should contain three values, commaseparated: the predicted values for y1, y2, y3. The rows should be in the same order as given in the test data. Note again that for any places where the test data indicates we have no ground truth for one or more of the three values, you can feel free to use a 0 for the prediction of that value, as it will be ignored and not contribute towards scoring. ScoringYour score for each individual prediction p, compared against actual groundtruth value t, will be p  t. The score for each row, r, will then be the mean of the scores for the individual predictions on that row (possibly 1, 2, or 3 values). Over the full n rows, your final score will be calculated as 10 * (1  Sum(r) / n). Thus a score of 10.00 represents perfect predictions with no error at all. All scores will be rounded down to two significant digits to help prevent overfitting. (Note that we may internally evaluate at higher precision in the event tiebreaking is needed for prize awards.) Submissions which are malformed in any way, such as having the wrong number of rows, values that do not parse as numeric, etc, will score a 0. To submit your entry, your code will only need to implement a single method, getURL(), which takes no parameters, and returns the URL at which your predictions CSV file can be downloaded. Example TestingAll example testing for this contest should be done offline, using the provided data. Note, however, that you may do a ���Test Examples����� submission using your predictions file for the provided test data. This will not provide any provisional scoring, but will confirm for you that the predictions file works correctly (URL is accessible, correct number of rows and columns, and numerical values parse correctly). This is not required, but can be used as a basic sanity check before making a full submission. BaselineA baseline score of 9.88 has been achieved using a combination of methodologies. Competitors in the main challenge will need to reach a score of 9.90 to be eligible for a prize, and the subcontests will need a score of at least 9.80. General Notes
Requirements to Win a Prize
If you place in the overall top (5), or top (3) in one of the four subcontests but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.  
Definition  
 
Examples  
0)  

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.