Genetic Data Classification Practice Challenge

Key Information

Submit
Submission Ends: 25d 1h

Timezone:Etc/UTC

Registration

Starts

Nov 14, 2022

14:06

Ends

Dec 14, 2022

14:06

Submission

Starts

Nov 14, 2022

14:12

Ends

Jan 02, 2023

14:11

Review

Starts

Jan 02, 2023

14:11

Ends

Jan 12, 2023

14:11

Appeals

Starts

Jan 12, 2023

14:11

Ends

Jan 13, 2023

14:11

Appeals Response

Starts

Jan 13, 2023

14:11

Ends

Jan 14, 2023

02:11

Winners Announced

Jan 14, 2023

02:11

Challenge Overview

Project Background

This challenge is a part of Topcoder's Practice challenge series, where the challenge is NOT aimed at solving a problem for a client, but is meant to help members practice and gain some experience (particularly members who don't have much experience with Topcoder)

This particular series will focus on Data Science and Machine Learning Practice.

Challenge Objectives

In context of the details shared above via the Project Background, in this challenge we are looking to introduce a classification practice problem, which is related to the field of genomics.

The dataset has been shared in the challenge forum.

Data Description

Features: There are 20531 features in the dataset, and one class label (which is in the final column of the train.csv file).

Labels: There are 5 possible classes: 0, 1, 2, 3 and 4

Size There are 560 rows in train.csv and 241 rows in test_x.csv.

Problem Domain: The dataset is related to genomics, and that is reason why this data has a lot more columns, compared to rows, which is common in genetic/genomic dataset. There might be similar genomic/genetic dataset related challenges in the near future and this project should help you to get familiar with datasets like these.

Challenge Requirements

Within this goal the following are the targets:

  • Train a model, using the file train.csv - The train.csv file can be found in the challenge dataset folder in the forum. That file should be used to train a machine learning model. Here the last (right-most) column in the file is the label, and all other columns are features
  • Using the trained model, take test_x.csv as feature input and predict the probabilities of each of the 5 classes - The test_x.csv file contains the test dataset. It does NOT contain the labels. Pass these features to your model, and generate a prediction with 241 rows (same as test_x.csv) and 5 columns (corresponding to 5 classes), with each column containing a decimal value between 0 to 1.

Submission Format

Create a zipped folder, which when unzipped should have these two folders in the root.

  • solution - This folder should contain the output prediction file solution.csv. This should be all the features of the test_x.csv + 5 more columns containing the probabilities. So in total, there should be 20536 columns in it (20531 simply copied from test_x.csv and 5 from your model's probability predictions).

  • code - This folder should contain all the code that was used to train the model as well as to test the model. It would be best to create a separate train.py and test.py file (or any other split, as long as invoking the testing code does not directly invoke the training code). Note - Note that if we try to generate the output using the inference code. in the code folder, the generated output should match this solution.csv Documentation - this folder should also contain your documentation.

Important - Please note that we will be using an automated method for testing, where after downloading the submission, the testing script will automatically unzip the submission and check if there are two folders with the above names. So please make sure that right after the zip file is unzipped, there are two folders available, and they are NOT inside another folder.

Note about Lowercase/uppercase - Please make sure that the folders are in lowercase (small-letters) i.e. 'solution' and 'code'.

Metric

This is a classification problem, and we'll be using the classification metric ROC AUC Score as the scoring metric.

Review Criteria

The submission will be manually reviewed (though the score calculation will be done using an automated script as much as possible). Hence it is important that your submission is in the correct format, as discussed in 'Submission format' above.

In addition, there can also be a subjective review of the submission to ensure that the code quality is up to the mark. Here, the code should be clear and comments should be used wherever appropriate.

In general, the ranking will be predominantly done on the basis of final ROC AUC Score, but points can be deducted in case the subjective aspects are not up to the mark.

What To Submit

To reiterate, the following folders should be available immediately after unzipping (as discussed in 'Submission format' above).

  • code - Should contain all the code + documentation (this should include a general introduction to your approach and a clear list of steps required to deploy your submission)
  • solution - Should contain ONLY the output solution.csv file and nothing else.

ELIGIBLE EVENTS:

2023 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30314105