Key Information

Register
Submit
The challenge is finished.
Show Deadlines

Challenge Overview

Objective

The purpose of this project is to develop an Artificial Intelligence (AI)/Machine Learning (ML) tool that will help astronomers detect very faint comets that approach the sun, referred to as “sungrazing comets”. Scientists at NASA want to improve their ability to detect very dim “category C” comets.

Additional Information

The Solar and Heliospheric Observatory (or SOHO) satellite was deployed in 1995 as a joint project between NASA and the ESA to study our sun more closely. It is considered one of the most successful NASA missions of all time, originally operating 15 different instruments to study the Sun’s interior, extended atmosphere, and the solar wind. Since launch, SOHO has discovered more than 4,000 new comets - well over half of all known comets - primarily in its Large Angle Spectrometric Coronagraph (LASCO) telescope. You can read more project background on our mini site!

NASA is particularly interested in so-called “Category C” comets. Category C comets are the faintest category of comet that SOHO discovers, often seen embedded within the background noise of the instrument, and are thus very difficult to detect with human eyes. The majority of undiscovered comets within SOHO’s data archive will be these extremely faint objects. Approximately 95% of SOHO comets fall into one of four comet ‘groups’, with members of these groups sharing similar orbital properties. The remaining 5% are so-called ”non-group” comets – sporadic, random comets not related to any other comet or comet ‘group’. Non-group comets are completely random in both location and timing, unlike comet groups for which we can predict WHERE the comets will appear in the data, but not when. We are guessing that most undiscovered non-group comets are also Category C (very faint) comets.

Official comet discovery credit will be given to competitors whose algorithms identify previously unobserved comets!

The metadata for the image set for this competition has been collected through NASA’s open science initiative.

In this challenge your task will be to create Python code that detects comets and tracks their movements in sequences of images from SOHO’s “LASCO C2” telescope. The presence and location of comets your algorithm returns will be compared to ground truth data, the quality of your solution will be judged by how much your solution matches the expected results, see Scoring for details.

Input Files

In this task you will work with data files in the standard astronomical Flexible Image Transport System (FITS) data format. A FITS file can be considered to be an image file and an associated metadata "header" which contains a number of metadata parameters about that image, including date, time, dimensions, exposure time, and other miscellaneous properties.

Data is organized into image sequences. In the training data each such sequence corresponds to the observed appearances of a single comet, the name of the folder is cmtXXXX where XXXX is a 4-digit identifier of the comet. Note that in rare cases a single image may contain two or more comets. In this case the same .fts file appears in more than one folder. (E.g. 22032653.fts appears in these two sequences: cmt0218 and cmt0599.) In the training data the image files are accompanied with regular text files that describe the position of the observed comets, as well as some other features of them.

In the test data, image sequences may contain zero, one or more comets. Contrary to the training data you can't assume that a comet appears on the first image of the sequence and disappears on the last. Most test sequences have a random 8-digit identifier and typically span a couple of hours of observation. The single exception is a sequence named SET1 that contains 3 days worth of data. Naturally the test data does not contain files that describe the position of comets.

See TRAINING_DATA_README.txt for further details on how the data is organized.

Downloads

Input files are available for download from the nasa-comets AWS bucket. A separate guide is available that details the process of obtaining the data.

The following files are available for download.

  • train.zip (19.8 GB, 48GB expanded). The full training data set.
  • TRAINING_DATA_README.txt. Contains description of the format and content of the training data.
  • train-gt.txt contains the training ground truth in a single file, in the same format as is required for output files in this contest. It can be useful for offline testing.
  • test.zip (11.7 GB, 28G expanded). The provisional testing data set. Your submissions must contain comet detections from this data set.
  • sample-submission.zip. A sample submission package to illustrate the required submission format. It also contains code that you can use as a starting point.
  • train-sample.zip (140 MB) A small subset of the training data. Use this if you want to get familiar with the data without having to download the full set.

Important note: no additional data beyond those listed above can be used in this contest. As the contest data has been collected in part through NASA’s open science initiative, you may find some of the contest metadata online. Using such data is strictly forbidden. Note that during final testing the contest organizers will make sure that the winners' complete training and testing workflow is reproducible using only the contest's own published data sources, and also we'll run tests on new data that is obtained during the course of the contest.

Output File

The output of your algorithm is a single CSV file that lists ALL detected comets from ALL image sequences of the test set. One line in the CSV file corresponds to observations of a single comet. The required format is:

{sequence-id},[{image-id},{x},{y},]...,{confidence} where

  • {sequence-id} is the identifier of the image sequence, i.e. the folder name containing the .fts files,
  • {image-id} is the name of the FITS file, the name must include the .fts extension,
  • {x} and {y} are real values in the [0,...,1024] range, representing the coordinates of the center of a detected comet on the given image.
  • {confidence} is a real value in the [0,...,1] range, signifying how sure your algorithm is in this detection. See Scoring for details on how this value is used.

A line corresponds to a unique comet you detect in a sequence. It is possible to have two or more lines with the same {sequence-id} if you detect more than one comet in the same sequence. A line should contain as many {image-id},{x},{y} triplets as many images in the image sequence you detected this unique comet on. If your algorithm does not detect any comets in a sequence then don't add any lines with the corresponding {sequence-id} to the CSV file.

This sample line describes a perfect and confident detection of cmt0052 from the training data: cmt0052,22167977.fts,1009.73,897.70,22167978.fts,984.26,884.06,22167979.fts,962.63,867.21,1.0

Constraints

  • The output file must not contain more than 5000 lines.
  • The same image-id must not appear more than once in the same line.

Packaging your Submission

Your submission should be a single ZIP file with the following format:

/solution
  /solution.csv
/code
  // your code, see details in the Final testing section
  • The folder structure within the zipped package must be exactly as specified above.
  • solution.csv must contain comet detections from all image sequences of the test set. The format of the file is described above in the Output File section. See the sample submission package for an example.

Your output must only contain algorithmically generated detections. It is strictly forbidden to include hand labeled data, or data that - although initially machine generated - is modified in any way by a human.

Submission format and code requirements

This match uses a combination of the "submit data" and "submit code" submission styles. In the online submission phase your output file (generated offline) is compared to ground truth, no code is executed on the evaluation server. In the final testing phase your training and testing process is verified by executing your system.

The required format of the submission package is specified in a submission template document. This current document gives only requirements that are either additional or override the requirements listed in the template.

  • Your submission must be coded in Python.
  • You must not submit more often than 3 times a day. The submission platform does not enforce this limitation, it is your responsibility to be compliant to this limitation. Not observing this rule may lead to disqualification.
  • An exception to the above rule: if your submission scores 0 or -1, then you may make a new submission after a delay of 1 hour.
  • See the General notes section about the allowed open source licenses and 3rd party libraries.
  • The stakeholders of this contest are especially interested in the way how your algorithm handles noise reduction and other aspects of image preprocessing. Therefore it is required that these parts of your system are sufficiently isolated (i.e. into their own class, module, set of functions, etc.) and well documented. Please make a reasonable effort to make sure that readers of your code can easily understand how and where image processing happens.

Scoring

During scoring, your output files (as contained in your submission file during provisional testing, or generated by your docker container during final testing) will be matched against the expected ground truth data using the following method. If your solution is invalid (e.g. if the tester tool can't successfully parse its content or if it violates the constraints listed in the Output File section), you will receive a score of -1. If your submission is valid, your score will be calculated using the average precision metric as follows: 1. The lines of your output file are sorted by confidence, from highest to lowest. Then for each line: 1. The line is counted as a True Positive (TP) if there is a comet in the ground truth for which all of these are satisfied: 1. For at least 5 images in the image sequence the true and reported positions of the comet are not further away than 10 pixels (calculated by Euclidean distance). 1. Let N be the number of images in the ground truth for this comet. For at least (N + 5) / 2 the true and reported positions are not further away than 25 pixels. 1. Otherwise the line is counted as a False Positive (FP). 1. Taking the ranked list of TP and FP values we calculate AP as the area under the precision-recall curve, see here for a description of this process.

Finally for display purposes the score is calculated as AP scaled up to the [0...100] range.

Notes:

  • There are comets in the test data for which there are less than 5 observations. These lines are ignored in score calculation.
  • See the source code of the scorer tool for the exact details of the AP calculation.

Final testing

This section details the final testing workflow, and the requirements against the /code folder of your submission are also specified in the submission template document. This current document gives only requirements or pieces of information that are either additional or override those given in the template. You may ignore this section until you decide you start to prepare your system for final testing.

  • The signature of the train script is as given in the template:
train.sh {data_folder}

The supplied {data_folder} parameter points to a folder having the training data in the same structure as is available for you during the coding phase, zip files already extracted. The supplied {data_folder} is the parent folder of the subfolders representing image sequences.

  • The allowed time limit for the train.sh script is 8 GPU-days (2 days on a p3.8xlarge with 4 GPUs). Scripts exceeding this time limit will be terminated.
  • A sample call to your training script follows. Note that folder names are for example only, you should not assume that the exact same folders will be used in testing.
./train.sh /data/comets/train/

In this sample case the training data looks like this:

  /data
    /comets
      /train
        /cmt0001
          22539952.fts
          22539953.fts
          ... etc., other fts files
        /cmt0002
        ... etc., other cmt folders
  • The signature of the test script:
test.sh {data_folder} {output_folder}

The testing data folder contains similar image sequences as is available for you during the coding phase.

  • The allowed time limit for the test.sh script is 12 GPU-hours (3 hours on a p3.8xlarge with 4 GPUs) when executed on the full provisional test set (the same one you used for submissions during the contest). Scripts exceeding this time limit will be terminated.
  • A sample call to your testing script follows. Again, folder and file names are for example only, you should not assume that the exact same names will be used in testing.
/test.sh /data/comets/test/ /wdata/my_output/

In this sample case the testing data looks like this:

	data/
    comets/
      test/
        12345678/
          22546251.fts
          ... etc., other fts files
        SET999/
        ... etc., other image sequence folders
  • To speed up the final testing process the contest admins may decide not to build and run the dockerized version of each contestant's submission. It is guaranteed however that at least the top 10 ranked submissions (based on the provisional leader board at the end of the submission phase) will be final tested.
  • Hardware specification. Your docker image will be built, test.sh and train.sh scripts will be run on a p3.8xlarge Linux AWS instance. Please see here for the details of this instance type.

Additional Resources

General Notes

  • This match is rated.
  • Relinquish - Topcoder is allowing registered competitors to "relinquish". Relinquishing means the member will compete, and we will score their solution, but they will not be eligible for a prize. Once a person relinquishes, we post their name to a forum thread labeled "Relinquished Competitors". Relinquishers must submit their implementation code and methods to maintain leaderboard status.
  • In this match you may use open source languages and libraries. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM (see “Requirements to Win a Prize” section). Submissions will be deleted/destroyed after they are confirmed. The contest stakeholders will not purchase licenses to run your code. Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.
  • You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client. If your solution includes licensed elements (software, data, programming language, etc) make sure that all such elements are covered by licenses that explicitly allow commercial use.
  • As stated in the Input files section no external data sets are allowed to be used in this contest.
  • Pre-trained networks (e.g. pre-built segmentation models) are allowed for use in the competition provided the following are satisfied:
    • The pre-trained networks are unencumbered with legal restrictions that conflict with its use in the competition.
    • The data source or data used to train the pre-trained network is defined in the submission description.
  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

Requirements to Win a Prize

In order to receive a final prize, you must do all the following:

Achieve a score in the top 7 according to final system test results. See the "Final scoring" section above. The provided sample code scores 5.68, you must score higher than that.

Comply with all applicable Topcoder terms and conditions.

Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit a report outlining their final algorithm explaining the logic behind and steps to its approach. You will receive a template that helps when creating your final report.

If you place in a prize winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.