Stunting (shortness for age) affects more than one in four children worldwide. Wasting (under-weightedness) and stunting in early childhood is associated with lethargy, reduced levels of play, an increased risk of early death, higher burden of disease, compromised physical capacities, and diminished cognitive development. Stunting and wasting in the first two years of life have been shown to be associated with lower school attainment and reduced economic productivity. This can reduce the productivity of an entire generation. Furthermore, stunting between 12 and 36 months is also linked to poor cognitive performance and/or lower school grades in middle childhood, and both height and head circumference at 2 years were shown to be inversely associated with educational attainment.
The ability to predict, at birth and early in childhood, whether a child is on an appropriate growth trajectory will help initiate preventive or therapeutic interventions leading to good cognitive growth and development outcomes as determined by school performance and a thriving child- and adulthood.
Our goal is to determine a combination of early measures that would be a good predictor for recumbent length (length of child measured while child is lying down, cm), weight (kg), and head circumference (cm). In pursuit of this goal, we have collected time series measurements of child growth, and family trait data (mother���s age, mother���s height, number of previous pregnancies, breast-feeding practices, and father���s height). We would like you to use this data to predict a child���s weight, recumbent length, weight, and head circumference in the attached dataset where values have been censored.
You may download the learning data set from here. The format for the data in the data set is a csv with details provided below:
Col Variable Label Notes 1 SUBJID Subject ID 1 to N 2 AGEDAYS Age since birth at examination (days) Day 1 = day of birth 3 MUACCM Mid upper-arm circmuference (cm) 4 TSFTMM Triceps skinfold thickness (mm) 5 MUAZ MUAC for age z-score Per WHO algorithm 6 TSFTAZ Tricep SFT for age z-score Per WHO algorithm 7 BFEDFL Child breast fed on this day 1=Yes, 0=No 8 WEANFL Child being weaned on this day 1=Yes, 0=No 9 SITEID Investigational Site ID 1, 2, 3 or 4 10 SEXN Sex 1 = Male 2 = Female 11 GAGEBRTH Gestational age at birth in days 12 BRTHWEEK Week of Birth Jan 1st-7th = 1, Jan 8th���14th = 2, etc. 13 BWTREPT Reported birth weight (gm) 14 BIRTHLEN Birth length (cm) 15 BIRTHHC Birth head circumference (cm) 16 MAGE Maternal age at birth of child (yrs) 17 MHTCM Maternal height (cm) 18 FHTCM Fathers height (cm) 19 PARITY Maternal parity # of previous live births at time of this child���s birth. 20 WTKG Weight (kg) 21 HTCM Standing height (cm) 22 HCIRCM Head circumference (cm)
Each child is designated
during early childhood growth (with the time variable provided as Age since birth in days [column 2] and gestational age at birth in days [column 11] . The value ���.��� in any cell implies that the value has not been measured and is therefore not available.
An example of measurements for a single child is given below:
SUBJID AGEDAYS MUACCM TSFTMM MUAZ TSFTAZ BFEDFL WEANFL SITEID SEXN GAGEBRTH BRTHWEEK BWTREPT BIRTHLEN BIRTHHC MAGE MHTCM FHTCM PARITY WTKG HTCM HCIRCM 100 45 1 0 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 -1.423777038 -1.425738199 100 114 1.469656593 0.807642227 1.6 0.32 1 0 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 -0.357199356 -0.667208544 -0.892947184 100 130 1 0 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 -0.114118861 -0.656525027 100 149 1 1 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 0.014862626 -0.239867893 100 170 1.763972954 2.487053578 1.53 2.18 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 0.054549237 -0.186450312 -0.474161119 100 310 1.911131134 2.929003933 1.33 2.87 1 1 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 0.659770061 0.443877148 0.18393127 100 359 1 1 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 0.441493698 0.636180441 100 366 1.322498413 2.487053578 0.65 2.65 1 1 4 2 291 4 -0.308707977 0.498588401 -0.509778147 0.287446485 -0.069090909 4 0.560553533 0.657547473 0.423237593
For each prediction (wi, li and ci), where at least one of the DV values is missing, the error from the true Weight, Recumbent length and Head circumference will be measured as the squared Mahalanobis distance,
where S-1 is the inverse of the sample covariance matrix calculated on all data points in the complete dataset for the current month of the prediction. Current month = AGEDAYS / 30 (rounded down), so 0-29 = Month 0, 30-59 = Month 1, etc.
Scores will be calculated as a generalized R2 measure of fit. This is calculated as follows. The total sum of errors for the submission will be calculated as SSE = SUM(ei).
A baseline sum of squared error will be calculated by predicting the sample means for each measurement, where at least one of the DV values is missing, that is the mean values of w, l and c for the current training set, for the correct month (as explained above),
SSE0 = SUM(e0i)
Then the submission score will be Score = 1000000 * MAX(1 - SSE/SSE0, 0).
In the string trainingData, each string states a record of some measurement, and has 22 tokens, comma-separated, in the same order as described above in the table. As before missing values for non-DV variables are presented as ���.��� strings. In trainingData, not all DV values are present. The format of testingData is almost the same as the trainingData. The only difference is that some of the DV values are also replaced by ���.��� strings, therefore your task will be to predict them. Replacement goes in the following way:
N = number of time points for an ID X = random between 0 and N/2 inclusive Y = random between X and N inclusive foreach time point W(1..N) for an ID if W <= X or if AGEDAYS <90, then all three DV values present else if W <= Y then 'c' is replaced by "." else all three DV values are replaced by "."
The data with same IDs are consecutive and ordered by Agedays (time point). The returned string should contain the corresponding predictions for weight, recumbent length and head circumference of the child, in this particular order, comma-separated, for each time point, in the same order as it is in testingData. The length of the return array equals to the number of measurements.
NOTE: All data values are normalized as part of data obfuscation requirements.
Notes on Data Set Generation
Notes on Time Limits
Because different test types deal with different volumes of data, the time limits will also differ. Example tests are limited to 360s (6 minutes), provisional tests to 540s (9 minutes) and system tests to 900s (15 minutes). The testType parameter will be 0, 1, or 2, to indicate Example, Provisional, or System test, respectively, so that your code can take timing into account.
Scoring and Inverse Covariance
Example scoring code (with comments) here.
Full list of inverseS values for each month here.
This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.