Challenge Overview

Overview

Every day, work-related injury records are generated. In order to alleviate the human effort expended with coding such records, the Centers for Disease Control and Prevention (CDC) National Institute for Occupational Safety and Health (NIOSH), in close partnership with the Laboratory for Innovation Science at Harvard (LISH), is interested in improving their NLP/ML model to automatically read injury records and classify them according to the Occupational Injury and Illness Classification System (OIICS).

Objective

The task is a well-defined classification problem. The programming languages are strictly limited to Python and R.

The input training file is a spreadsheet, with 4 columns (text, sex, age, and event). This CSV file contains a header.

text. This column describes the raw injury description text data.
sex. This is a categorical variable, describing the sex of the related person.
age. This is a positive integer variable, describing the age of the related person.
event. This is the target variable, specifying the OIICS label to be classified. There are 48 unique labels in total.

You are asked to build a model based on the above training data. And your model will need to make predictions for the following test file.

The test file is a spreadsheet, with only 3 columns (text, sex, and age). This CSV file contains a header. The format is the same as the training file, but the event column will be missing. Once your model is trained, it should be able to consume the test file and produce the prediction file by filling in the Event column. Specifically, your output will be a CSV file with all 4 columns, keeping the same order as the test file.

Datasets

There are 229,820 records in total.

We have randomly sampled 153,956 of the records with the event column included as the training set. You are asked to use this training set to develop your model locally. The remaining 75,864 of the records are held out for the testing purpose. You can download the 3-column spreadsheet testing set here (The event column will be missing). In your submission, you will need to include all predictions for the records in the testing set, following the same order.

Baseline

The client has tried to build a model using BERT. It achieves an accuracy of around 87% based on their local testing. We will post the baseline codebase in the forum. Their model may help you identify potential avenues for improving the solution.

Submission format

This match uses a combination of the "submit data" and "submit code" submission styles. Your submission must be a single ZIP file with the following content:

/solution

solution.csv

/code

dockerfile

where

/solution/solution.csv is the output your algorithm generates on the provisional test set. The format of this file is described above in the Objective section.
/code contains a dockerized version of your system that will be used to reproduce your results in a well defined, standardized way. This folder must contain a dockerfile that will be used to build a docker container that will host your system during final testing. How you organize the rest of the contents of the /code folder is up to you, as long as it satisfies the requirements listed below in the Final testing section.

Notes:

During provisional testing, only your solution.csv file will be used for scoring, however the tester tool will verify that your submission file confirms to the required format. This means that at least the /code/dockerfile must be present from day 1, even if it doesn't describe any meaningful system to be built. However, we recommend that you keep working on the dockerized version of your code as the challenge progresses, especially if you are at or close to a prize winning rank on the provisional leaderboard.
You must not submit more often than once every 4 hours. The submission platform does not enforce this limitation, it is your responsibility to be compliant to this limitation. Not observing this rule may lead to disqualification.
During final testing, your last submission file will be used to build your docker container.
Make sure that the contents of the /solution and /code folders are in sync, i.e. your solution.csv file contains the exact output of the current version of your code.
To speed up the final testing process, the contest administrators may decide not to build and run the dockerized version of each contestant's submission. It is guaranteed, however that if there are N main prizes, then at least the top 2*N ranked submissions (based on the provisional leader board at the end of the submission phase) will be finally tested.

Scoring

During scoring, your solution.csv file (as contained in your submission file during provisional testing, or generated by your docker container during final testing) will be matched against expected ground truth data using the following criteria.

If your solution is invalid (e.g. if the tester tool can't successfully parse its content, or if the number of lines mismatches), you will receive a score of 0.

During the challenge, we will have a live leaderboard based on a fixed random subset of 50% of the testing set. Once the challenge is finished, we will re-evaluate your last submission against the other 50% of the testing set. The final ranking is decided based on the other 50% of the testing set. The purpose of doing this is to avoid overfitting.

We use the weight F1 score metric to evaluate your submission. Since this is a multi-class classification problem, we will compute the F1 score for each class label following the one-vs-rest transformation.

Focusing on the i-th OIICS label, we will transform the groundtruth labels and predicted labels to binary labels: 1 means it is the i-th OIICS label, while 0 means it is not. In this way, we can build a 2X2 confusion matrix and then compute its F1 score F1_i.

We have 48 unique OIICS labels in total. In order to combine those F1 scores of different labels, we use the label frequency in the groundtruth labels as weights. Suppose the i-th OIICS label appeared Freq_i times in the groundtruth labels. We then have the final weighted F1 score as:

weighted F1 = (\sum Freq_i * F1_i) / (\sum Freq_i)

It is obvious that this metric is always between 0 and 1.

Final normalized score:

Score = 100 * weighted F1

We will provide the tester tool in the forum. If you identified any issue, please make a reply to that thread.

Final testing

This section describes the final testing work flow and the requirements against the /code folder of your submission. You may ignore this section until you start to prepare your system for final testing.

To be able to successfully submit your system for final testing, some familiarity with Docker is required. If you have not used this technology before, you may first check this page and other learning material linked from there. To install docker, follow these instructions.

Contents of the /code folder

The /code folder of your submission must contain:

All your code (training and inference) that are needed to reproduce your results.
A Dockerfile (named dockerfile, without extension) that will be used to build your system.
All data files that are needed during training and inference, with the exception of
- the contest’s own training and testing data. You may assume that the contents of the /training and /testing folders (as described in the Input files section) will be available on the machine where your docker container runs, zip files already unpacked,
- large data files that can be downloaded automatically either during building or running your docker script.
Your trained model file(s). Alternatively your build process may download your model files from the network. Either way, you must make it possible to run inference without having to execute training first.

The tester tool will unpack your submission, and the

docker build -t <id> .

command will be used to build your docker image (the final ‘.’ is significant), where <id> is your TopCoder handle.

The build process must run out of the box, i.e. it should download and install all necessary 3rd party dependencies, either download from Internet or copy from the unpacked submission all necessary external data files, your model files, etc.

Your container will be started by the

docker run -v <local_data_path>:/data:ro -v <local_writable_area_path>:/wdata -it <id>

command (single line), where the -v parameter mounts the contest’s data to the container’s /data folder. This means that all the raw contest data will be available for your container within the /data folder. Note that your container will have read only access to the /data folder. You can store large temporary files in the /wdata folder.

Training and test scripts

Your container must contain a train and test (a.k.a. inference) script having the following specification:

train.sh <train-csv-file> should create any data files that your algorithm needs for running test.sh later. The supplied <train-csv-file> parameters point to the training CSV file, as described above. The allowed time limit for the train.sh script is 2 days.
As its first step, train.sh must delete your home-made models shipped with your submission.
Some algorithms may not need any training at all. It is a valid option to leave train.sh empty, but the file must exist nevertheless.
Training should be possible to do by working with only provided data and publicly available external data. This means that this script should do all the preprocessing and training steps that are necessary to reproduce your complete training workflow.
A sample call to your training script (single line):
./train.sh /data/train.csv

test.sh <test-csv-file> <output_path> should run your inference code using new, unlabeled data and should generate an output CSV file as specified by the problem statement. The allowed time limit for the test.sh script is 12 hours. The testing CSV file follows the same format as described before. The final testing data will be similar in size and in content to the provisional testing data. You may assume that the data folder path will be under /data.
Inference should be possible to do without running training first, i.e. using only your home-made model files.
It should be possible to execute your inference script multiple times on the same input data or on different input data. You must make sure that these executions don't interfere, i.e., each execution leaves your system in a state in which further executions are possible.
A sample call to your testing script (single line):
./test.sh /data/test.csv solution.csv

Code requirements

Your training and inference scripts must output progress information. This may be as detailed as you wish but at the minimum it should contain the number of epochs processed so far.
Your testing code must process the test and validation data the same way, that is it must not contain any conditional logic based on whether it works on images that you have already downloaded or on unseen images.

Verification workflow

First test.sh is run on the provisional test set to verify that the results of your latest online submission can be reproduced. This test run uses your home built models.
Then test.sh is run on the final validation dataset, again using your home built models. Your final score is the one that your system achieves in this step.
Next, train.sh is run on the full training dataset to verify that your training process is reproducible. After the training process finishes, further execution of the test script must use the models generated in this step.
Finally, test.sh is run on the final validation dataset (or on a subset of that), using the models generated in the previous step, to verify that the results achieved in step #2 above can be reproduced.

A note on reproducibility: we are aware that it is not always possible to reproduce the exact same results. E.g., if you do online training then the difference in the training environments may result in a different number of iterations, meaning different models. Also you may have no control over random number generation in certain 3rd party libraries. In any case, the results must be statistically similar, and in case of differences you must have a convincing explanation why the same result cannot be reproduced.

Hardware specification

Your docker image will be built and run on a Linux AWS instance, having this configuration:

m4.2xlarge

Please see here for the details of this instance type.

General Notes

This match is rated.
Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.
Teaming is not allowed. You must develop your solution on your own. Any communication between members beyond what is allowed by the forum rules is strictly forbidden and may lead to disqualification.
In this match you can only use Python and R.
You may use open-source libraries provided they are equally free for your use, use by another competitor, or use by the client.

Final prizes

In order to receive a final prize, you must do all the following:

Achieve a score in the top five according to final system test results. See the "Final testing" section above.
Your provisional score must be higher than that of the published baseline.
Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit a report outlining their final algorithm explaining the logic behind and steps to its approach. You will receive a template that helps creating in your final report.
If you place in a prize-winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

CDC Text Classification Marathon