Topcoder Challenge | Topcoder Community

Challenge Overview

Prize Distribution

Prize USD
1st $8,500
2nd $6,500
3rd $5,000
4th $4,000
5th $3,000

Performance Bonus
-Top submission above 0.7 F-Score $3,000

Total Prizes $30,000

Why this challenge matters

Treatment of patients with pancreatic cancer continues to pose a formidable challenge. Surgical resection remains the best treatment option. However, the definition of which tumors are considered resectable versus unresectable is quite subjective and dependent on many factors such as surgeon’s expertise and aggressiveness, institutional guidelines, and most importantly, by the amount of tumor contact to adjacent blood vessels. Several groups have tried to develop ways of classifying pancreatic tumors with regard to the chance of successful surgical resection, commonly based on computed tomography (CT) imaging using specific pancreas protocols. Yet, there is a lack of consensus and no single criterion has been adopted as gold standard. To have a clinical tool that could both expedite the identification of pancreatic tumors on CT images and provide an objective estimate of the probability of successful tumor resection would dramatically aid the management decision of such patients. We believe that a crowdsourced innovation solution structured as an online coding competition would be an effective alternative to achieve the proposed goals in a cost-effective manner.

Objective

In general terms, the challenge aim is to develop algorithms capable of automatically identifying pancreatic tumors and major peripancreatic vessels on diagnostic high-resolution CT images.

To accomplish this goal, proposed algorithms would have to initially be able to accurately auto-segment the boundaries of pancreatic tumors and 3 major peripancreatic blood vessels [superior mesenteric artery (SMA), celiac axis/common hepatic artery continuum (CA/CHA), and superior mesenteric vein/portal vein continuum (SMV/PV)] (Figure 1) through training based on a set of fully anonymized contrast-enhanced high-resolution CT scans in which tumors and such vessels were previously segmented in detail by a physician with expertise in the field (Figure 2).

In more technical terms, your task will be to extract polygonal areas that represent tumor regions and blood vessels from CT scan images. The polygons your algorithm returns will be compared to ground truth data, the quality of your solution will be judged by how much your solution overlaps with the expected results, see Scoring for details.

Input Files

In this task you will work with anonymized CT scans. Data corresponding to a scan is made up of 3 components: it consists of several images (which represent horizontal cross sections, slices of the human body), textual metadata that gives information on how to interpret the image data (e.g. how to translate pixel coordinates into physical coordinates), and ground truth annotations of region contours that describe tumor regions and vessels within the images. Region contour annotations are present only in the training data and removed from provisional and system testing data.

Data corresponding to a scan is organized into the following file structure:

/<scan_id>
/Set_<set_id>

Slice_<slice_id>_CT_Image.png
CT_Image.txt
Slice_<slice_id>_Region_<region_id>_Structure_<struct>.txt
Structure_Tumor_Center_Pixel.txt

Where

<scan_id> is the anonymized unique identifier of the scan, also called ‘Patient ID’.
<set_id> is an integer number, typically 1, it doesn't have any significance in the contest.
<slice_id> is the identifier of the image slice, it is unique within a scan. Slice IDs are 1-based, continuous, integer numbers.
<region_id> is an integer number, typically 1. Higher numbers are used if there are more than 1 closed shapes of the same type present on a slice. E.g. if a vessel bifurcates and appears as two ellipses on an image then the two disjoint parts will be described in two .txt files having 'Region_1' and 'Region_2' in their names.
<struct> is the identifier of a ground truth structure, one of {Tumor, CA_CHA, PV_SMV, SMA}.

The 3 data components of a scan are described in more detail below.

Images

Scan slices are grayscale PNG images with 16 bit per pixel color depth. Most image viewers can display only 8 bit per pixel PNG images. We recommend that you use the visualizer tool that comes with the challenge to look at the scans.

The number of images per scan varies between 39 and 809 (inclusive). All images in this contest will be of size 512x512 pixels.

Image metadata

Metadata are described in a CT_Image.txt file, which gives the number of slices this scan contains and the conversion ratios between pixel space and physical space.

Contours

The location and shape of known regions on images are referred to as ‘ground truth’ in this document. These data are described in text files. Note that contour descriptions are present only in the training data. The name of such files are in the form

Slice_<slice_id>_Region_<region_id>_Structure_<struct>.txt

where <slice_id> references an image of the scan and <struct> references a named structure that a radiologist marked up on the image.

In this contest the most important structures are named “Tumor”, these are the pancreatic tumor regions that your algorithms must learn to identify. In the current contest you can assume that each scan contains at least one tumor region.

The contour definition files describe the regions of interest as polygons, by enumerating the polygon's vertices, one per line. Vertex coordinates are given in pixel space, {X=0,Y=0} is the upper left corner of the image, the X axis is directed to the right, the Y is directed downwards. The points may be listed either in clockwise or counter-clockwise order within the scan's plane.

An additional piece of information about tumor structures is available for most (but not all) of the scans: a seed point. Seed points do NOT represent the center of mass of a tumor, you can only assume that they are located within a tumor region. During inference this point (if given) can give your algorithm a hint of the location of the tumor. The seed point is described in a file named Structure_Tumor_Center_Pixel.txt in a self explanatory format. The x and y values represent pixels, the z value represents scans.

Downloads

The following files are available for download.

sample.zip (260 MB). Contains 3 scans of the training data. Use this if you want to get familiar with the data without downloading the full data set. The package also contains a sample solution file that scores non-zero on this sample set.
train.zip (12 GB). The full training data set.
test.zip (5.5 GB). The provisional testing data set. Your submissions must contain contour extractions from this data set.

It is recommended to verify the integrity of the large files before trying to extract them.

SHA1(train.zip) = 50F81967BAF45DEF52686D0F314AB4F94EB49A3F
SHA1(test.zip) = F0549E322B1A1F6E0074E8522AB8885ECD2A20EA

Output File

Your output must be a CSV file where each line specifies a polygon that your algorithm extracted as a tumor or vessel region. The required format is:

<scan_id>,<slice_id>,<struct>,x1,y1,x2,y2,x3,y3,...

Where <scan_id>, <slice_id> and <struct> are the unique identifiers of a scan, an image slice, and a structure type, respectively, as defined in the Input Files section above. (Angle brackets are for clarity only, they should not be present in the file.) As is the case with contour definition files, the x and y values should be given in pixel space. The ground truth contour definition files contain only integer x and y values, but in your output you may use real numbers.

Sample lines that describe

a rectangular tumor region extracted from slice #17 of scan Patient_1,
and two rectangular regions of the SMA vessel on the same slice:

Patient_1,17,Tumor,80,0,100,0,100,10,80,10
Patient_1,17,SMA,50,50,60,50,60,60,50,60
Patient_1,17,SMA,80,50,90,50,90,60,80,60

The polygons need not be closed, i.e. it is not required that the first and last points of the list are the same.

Your output must be a single file with .csv extension.

Your output must only contain algorithmically generated contour descriptions. It is strictly forbidden to include hand labeled contours, or contours that - although initially machine generated - are modified in any way by a human.

Submission format and code requirements

This match uses a combination of the "submit data" and "submit code" submission styles. In the online submission phase your output file (generated off line) is compared to ground truth, no code is executed on the evaluation server. In the final testing phase your training and testing process is verified by executing your system.

The required format of the submission package is specified in a submission template document. This current document gives only requirements that are either additional or override the requirements listed in the template.

You must not submit more often than 3 times a day. The submission platform does not enforce this limitation, it is your responsibility to be compliant to this limitation. Not observing this rule may lead to disqualification.
An exception to the above rule: if your submission scores 0 or -1, then you may make a new submission after a delay of 1 hour.
The /solution folder of the submission package must contain the solution.csv file, which should be formatted as specified above in the Output file section and must list the extracted tumor and vessel regions from all images of all scans in the test set.

Scoring

During scoring, your solution.csv file (as contained in your submission file during provisional testing, or generated by your docker container during final testing) will be matched against the expected ground truth data using the following method.

If your solution is invalid (e.g. if the tester tool can't successfully parse its content, or it violates the size limits), you will receive a score of -1.

If your submission is valid, your score will be calculated as follows:

First for each scan of the test TP (true positive), FP (false positive) and FN (false negative) values are calculated for those slices where either ground truth regions are present, or regions extracted by your solution are present, or both. Here

TP is the area (measured in pixels) of the overlap of expected and extracted regions,
FP is the area that your solution extracted but which does not belong to expected regions,
FN is the area of the expected regions that is not covered by your extracted regions.

The above is calculated separately for the 4 structure types (Tumor, CA_CHA, PV_SMV, SMA). An exception to the above rule marked in italics is the following: vessels are contoured in the ground truth only on slices that contain also tumor regions or are close to such slices containing tumor regions. This means that you may correctly find vessels elsewhere but there are no corresponding ground truth annotations. These won't count as FP: for vessels FP is counted only on slices that actually do contain ground truth contour annotations of the same vessel type.

These areas are then summed up for each scan, this gives the total TP, FP and FN numbers for the scan, separately for each of the 4 structure types.

Then 4 F1-scores are calculated using these 4 sets of TP, FP and FN numbers:

If TP = 0 then F1 = 0,

otherwise

precision = TP / (TP + FP)

recall = TP / (TP + FN)

F1 = 2 * precision * recall / (precision + recall)

The score for the scan is the weighted average of the 4 F1 values, where the weight of the Tumor type is 7, the weight of each of the vessels is 1.

Finally the average of such scan scores are calculated, and for display purposes scaled up to the [0...100] range.

For the exact algorithm of the scoring see the visualizer source code.

Final testing

This section details the final testing workflow, and the requirements against the /code folder of your submission are also specified in the submission template document. This current document gives only requirements or pieces of information that are either additional or override those given in the template. You may ignore this section until you decide you start to prepare your system for final testing.

The signature of the train script is as given in the template:
train.sh <data_folder>
The supplied <data_folder> parameter points to a folder having the training data in the same structure as is available for you during the coding phase, zip files already extracted. The supplied <data_folder> will contain all scan images and metadata files.
The allowed time limit for the train.sh script is 8 GPU-days (2 days on a p3.8xlarge with 4 GPUs). Scripts exceeding this time limit will be truncated.
A sample call to your training script (single line) follows. Note that folder names are for example only, you should not assume that the exact same folders will be used in testing.
./train.sh /data/pancreatic/train/
In this sample case the training data looks like this:
data/

pancreatic/
train/

Patient_1/
Set_1/
Slice_1_CT_Image.png
CT_Image.txt
Slice_1_Region_1_Tumor.txt
Structure_Tumor_CenterPixel.txt

... etc., other .png and .txt files

Patient_2/
... etc., other scans

The signature of the test script:
test.sh <data_folder> <output_file>
The testing data folder contains similar scan data as is available for you during the coding phase.
The allowed time limit for the test.sh script is 12 GPU-hours (3 hours on a p3.8xlarge with 4 GPUs) when executed on the full provisional test set (the same one you used for submissions during the contest). Scripts exceeding this time limit will be truncated.
A sample call to your testing script (single line) follows. Again, folder and file names are for example only, you should not assume that the exact same names will be used in testing.
./test.sh /data/pancreatic/test/ /wdata/my_output.csv
In this sample case the testing data looks like this:
data/

pancreatic/
test/
Patient_1000/
Set_1/
... png and txt files
... etc, other scans

To speed up the final testing process the contest admins may decide not to build and run the dockerized version of each contestant's submission. It is guaranteed however that at least the top 10 ranked submissions (based on the provisional leader board at the end of the submission phase) will be final tested.
Hardware specification. Your docker image will be built, test.sh and train.sh scripts will be run on a p3.8xlarge Linux AWS instance. Please see here for the details of this instance type.

Additional Resources

A visualizer is available here that you can use to test your solution locally. It displays your extracted tumor contours, the expected ground truth, and the difference of these two. It also calculates scores as defined in the Scoring section so it serves as an offline tester. (But note that the visualizer does not enforce the limits on allowed file size and number of lines.)
See Doctor Mak’s video session at the Harvard Tumor Hunt Challenge Minisite on an introduction to radiotherapy and insights into how manual tumor contouring is done by expert radiologists.
You may find these papers, additional CT imagery and segmentation data useful. Unfortunately not all of these are freely available.
- Normal structure atlases:
  - https://www.imaios.com/en/e-Anatomy/Thorax-Abdomen-Pelvis/Abdomen-Pelvis-CT
- Segmentation atlas (PDF format of a contouring guide for radiation oncologists):
  - https://www.rtog.org/LinkClick.aspx?fileticket=qlz0qMZXfQs%3d&tabid=361
  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4285338
- Large Repository of Cancer Imaging (TCIA): http://www.cancerimagingarchive.net/

General Notes

This match is NOT rated.
Teaming is allowed. Topcoder members are permitted to form teams for this competition. After forming a team, Topcoder members of the same team are permitted to collaborate with other members of their team. To form a team, a Topcoder member may recruit other Topcoder members, and register the team by completing this Topcoder Teaming Form. Each team must declare a Captain. All participants in a team must be registered Topcoder members in good standing. All participants in a team must individually register for this Competition and accept its Terms and Conditions prior to joining the team. Team Captains must apportion prize distribution percentages for each teammate on the Teaming Form. The sum of all prize portions must equal 100%. The minimum permitted size of a team is 1 member, the maximum permitted team size is 5 members. Only team Captains may submit a solution to the Competition. Topcoder members participating in a team will not receive a rating for this Competition. Notwithstanding Topcoder rules and conditions to the contrary, solutions submitted by any Topcoder member who is a member of a team on this challenge but is not the Captain of the team are not permitted, are ineligible for award, may be deleted, and may be grounds for dismissal of the entire team from the challenge. The deadline for forming teams is 11:59pm ET on the 21th day following the date that Registration & Submission opens as shown on the Challenge Details page. Topcoder will prepare a Teaming Agreement for each team that has completed the Topcoder Teaming Form, and distribute it to each member of the team. Teaming Agreements must be electronically signed by each team member to be considered valid. All Teaming Agreements are void, unless electronically signed by all team members by 11:59pm ET of the 28th day following the date that Registration & Submission opens as shown on the Challenge Details page. Any Teaming Agreement received after this period is void. Teaming Agreements may not be changed in any way after signature.
The registered teams will be listed in the contest forum thread titled “Registered Teams”.
Organizations such as companies may compete as one competitor if they are registered as a team and follow all Topcoder rules.
Relinquish - Topcoder is allowing registered competitors or teams to "relinquish". Relinquishing means the member will compete, and we will score their solution, but they will not be eligible for a prize. Once a person or team relinquishes, we post their name to a forum thread labeled "Relinquished Competitors". Relinquishers must submit their implementation code and methods to maintain leaderboard status.
In this match you may use open source languages and libraries, and publicly available data sets, with the restrictions listed in the next sections below. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM (see “Requirements to Win a Prize” section). Submissions will be deleted/destroyed after they are confirmed. Topcoder will not purchase licenses to run your code. Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.
You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client.
If your solution includes licensed software (e.g. commercial software, open source software, etc.), you must include the full license agreements with your submission. Include your licenses in a folder labeled “Licenses”. Within the same folder, include a text file labeled “README” that explains the purpose of each licensed software package as it is used in your solution.
External data sets and pre-trained networks (pre-built segmentation models, additional CT imagery, etc) are allowed for use in the competition provided the following are satisfied:
- The external data and pre-trained network dataset are unencumbered with legal restrictions that conflict with its use in the competition.
- The data source or data used to train the pre-trained network is defined in the submission description.
- The external data source must be declared in the competition forum not later than 14 days before the end of the online submission phase to be eligible in a final solution. References and instructions on how to obtain are valid declarations (for instance in the case of license restrictions). If you want to use a certain external data source, post a question in the forum thread titled “Requested Data Sources”. Contest stakeholders will verify the request and if the use of the data source is approved then it will be listed in the forum thread titled “Approved Data Sources”.
Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

Requirements to Win a Prize

In order to receive a final prize, you must do all the following:

Achieve a score in the top five according to final system test results. See the "Final scoring" section above.

Comply with all applicable Topcoder terms and conditions.

Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit a report outlining their final algorithm explaining the logic behind and steps to its approach. You will receive a template that helps when creating your final report.

If you place in a prize winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

Pancreatic Cancer Detection Challenge

Key Information