ico-arrow-big-left

Mud Log OCR with Google Vision

Key Information

Register
Submit
The challenge is finished.
Show Deadlines

Challenge Overview

Introduction

According to Wikipedia, "Mud logging is the creation of a detailed record (well log) of a borehole by examining the cuttings of rock brought to the surface by the circulating drilling medium (most commonly drilling mud)." Quartz Energy has provided Topcoder with a set of mud logs and we’re developing an application to extract structured meaning from these records. The documents are very interesting - they are even oil-well shaped! You can read more details about them here. If oil is revealed in a well hole sample, a "Show" may be recorded in logs. This is one of the most important pieces of information in the mud logs. Our first attempt to gather information from these files is going to be to find the relevant mud logging terms within the text of these mud logs.

 

The mud logging terms relevant for this contest (phrases) are split into the following categories. The full list can be found here.

  1. Shows

  2. Stains

  3. Traces

  4. Negatives

In this contest, all of the above phrases are equally important.

 

Objective

Your task is to identify all occurrences of the relevant mud logging terms, hereafter referred to as "phrases", in a set of input mud log images. 

 

Here are some related challenges:

  1. Marathon Challenge: https://community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=17004&pm=14713

  2. Several previous development challenges and additional modifications resulted in a baseline solution which can be found here

 

The major differences between this challenge and the past challenges are:

1. We've updated our Ground Truth data to track phrases that break across lines.  Our previous version of the code couldn't handle this use case. If the phrase No Cut was displayed like so:

No

Cut 

Then we had to record the phrase as cut.  Now our ground truth data tracks multi-line phrases.

2.  The number of phrases has been increased.

 

Let’s take a look at some examples to better understand the Difference #1.

A small portion of a black-and-white mud log image is shown below. In this image, the relevant mud logging phrases are in the rightmost section. For illustration purposes only, the phrase "stn" has been correctly identified and its bounding box has been highlighted in blue.

 

For phrases which appear to span multiple lines, the text on each line should be considered independently. In the below example of a mud log, it may appear that the final phrase in this text block is "no show". For illustration purposes only, the phrase "show" has been correctly identified and its bounding box has been highlighted in green. In this challenge, your model should ideally return “no show” with two bounding boxes. The order of these two boxes doesn’t matter.

 

We've done some testing and the Google Vision API has pretty good performance. So we REQUIRE you to use of Google Vision but we're still hoping to increase/improve the retrieval capabilities of the tool.  Google has a AutoML capability where you can load up your own images and train a model.  We offer a Google API key to you so cost isn't an issue. The details can be found in the forum.

 

We also provide the previous codebase. You are REQUIRED to utilize and build upon this codebase. The previous challenges were Java only.  The phrases your algorithm returns will be compared to the ground truth data, and the quality of your solution will be judged according to how well your solution matches the ground truth. See the "Scoring" section for details.

 

If you submit the provided baseline submission, you will see a score of approximately 36.33 in the provisional test. You must improve upon this baseline and achieve a better score than the baseline in both provisional and system tests, in order to be eligible for a prize.

Input Files

The only input data which your algorithm will receive are the raw mud log images in TIFF format. These image files typically have a height much larger than their width. The maximum image width is 10,000 pixels and the maximum image height is 200,000 pixels. While these images contain a large amount of information, we are only interested in the identification of relevant mud logging phrases.

 

Training Data

The training data set has 205 images which can be downloaded here and the ground truth can be downloaded here. The ground truth file has the same columns as the output file described below. You can use this data set to train and test your algorithm locally.

 

Testing Data (Provisional & Final)

The testing data set, containing only images, has 205 images and can be downloaded here. This image set has been randomly partitioned into a provisional set with 61 images and a system set with 144 images. The partitioning will not be made known to the contestants during the contest. The provisional set is used only for the leaderboard during the contest. During the competition, you will submit your algorithm's results when using the entire testing data set as input. Some of the images in this data set, the provisional images, determine your provisional score, which determines your ranking on the leaderboard during the contest. This score is not used for final scoring and prize distribution. Your final submission's score on the system data set will be used for final scoring. See the "Final Scoring" section for details.

 


Output File

This contest uses the result submission style. For the duration of the contest, you will run your solution locally using the provided provisional data set images as input and produce a CSV file which contains your results.

 

Your output file must be a CSV file with the header which contains one phrase per row. The header looks like "IMAGE_NAME,OCR_PHRASE,PHRASE_OCCURRENCE_ID,X1,Y1,X2,Y2". The CSV file should have the following columns, in this order:

  1. IMAGE_NAME - full image name, including file extension

  2. OCR_PHRASE - identified phrase in lowercase (one of the relevant mud logging terms)

  3. PHRASE_OCCURRENCE_ID - the primary key to each phrase occurrence --- different bounding boxes of the same phrase occurrence will be grouped based on this ID.

  4. X1 - pixel x coordinate of upper left corner of phrase bounding box

  5. Y1 - pixel y coordinate of upper left corner of phrase bounding box

  6. X2 - pixel x coordinate of lower right corner of phrase bounding box

  7. Y2 - pixel y coordinate of lower right corner of phrase bounding box

 

Therefore, each row of your result CSV file must have the format:

<IMAGE_NAME>,<OCR_PHRASE>,<PHRASE_OCCURRENCE_ID>,<X1>,<Y1>,<X2>,<Y2>

Image name and phrase should not be enclosed in quotation marks, and each phrase should be composed of only lower case alphabet letters and spaces. For example, one row of your result CSV file may be:

1234567890_sample.tif,oil stain,1,123,456,200,500

 

Note that the same phrase in an image can have multiple bounding boxes. You can add multiple rows to address these cases. <PHRASE_OCCURRENCE_ID> serves as the primary key to group multiple bounding boxes of the same phrase in the same image together. However, please make sure

  1. The number of bounding boxes of the same phrase occurrence is no more than the number of words in this phrase.

  2. The bounding boxes of the same phrase occurrence should not have any overlaps.

Submission format

This match uses a combination of the "submit data" and "submit code" submission styles. Your submission must be a single ZIP file with the following content:

 

/solution

    solution.csv

/code

    dockerfile

    <your code>

 

, where

  • /solution/solution.csv is the output your algorithm generates on the provisional test set. The format of this file is described above in the Output file section.

  • /code contains a dockerized version of your system that will be used to reproduce your results in a well defined, standardized way. This folder must contain a dockerfile that will be used to build a docker container that will host your system during final testing. How you organize the rest of the contents of the /code folder is up to you, as long as it satisfies the requirements listed below in the Final testing section.

 

Notes:

  • During provisional testing only your solution.csv file will be used for scoring, however the tester tool will verify that your submission file confirms to the required format. This means that at least the /code/dockerfile must be present from day 1, even if it doesn't describe any meaningful system to be built. However, we recommend that you keep working on the dockerized version of your code as the challenge progresses, especially if you are at or close to a prize winning rank on the provisional leaderboard.

  • You must not submit more often than once every 4 hours. The submission platform does not enforce this limitation, it is your responsibility to be compliant to this limitation. Not observing this rule may lead to disqualification.

  • During final testing your last submission file will be used to build your docker container.

  • Make sure that the contents of the /solution and /code folders are in sync, i.e. your solution.csv file contains the exact output of the current version of your code. 

  • To speed up the final testing process the contest admins may decide not to build and run the dockerized version of each contestant's submission. It is guaranteed however that if there are N main prizes then at least the top 2*N ranked submissions (based on the provisional leader board at the end of the submission phase) will be finally tested.

Scoring

During scoring your solution.csv file (as contained in your submission file during provisional testing, or generated by your docker container during final testing) will be matched against  expected ground truth data using the following algorithm.

 

If your solution is invalid (e.g. if the tester tool can't successfully parse its content, or if it contains an unknown filename), you will receive a score of 0. 

 

Provisional submissions should include predictions for all images in the testing data set. Provisional submissions will be scored against the provisional image set during the contest. Your final provisional system will be scored against the system image set at the end of the contest.

 

Phrase predictions will be scored against the ground truth in the following way.

 

We will first group multiple bounding boxes of the same phrase in the same image together. So for every phrase, we will have a set of non-overlapping bounding boxes.

 

The overlap factor between 2 sets of non-overlapping bounding boxes, A and B, is the area of the intersection of A and B divided by the area of the union of A and B. We use O(A, B) to denote this measure. Specifically, we have 

O(A, B) = area(A ∩ B) / area(A ∪ B)

It is obvious that this factor is always between 0 and 1.

 

For each image, all of the predicted phrases for this image are iterated over in the order they appear in the csv file. The counters numFp, numMatch, numMatchPhrase, and numMissed are initialized to zero.

For each phrase:

  • The maximum overlap factor, Omax, between this prediction and each individual phrase in the ground truth is calculated

  • If Omax is greater than 0.3, then this constitutes a match, and the matching ground truth phrase is ignored for future predictions

  • If not matched, numFp is increased by 1

  • If matched, numMatch is increased by 1

  • If matched and phrase texts are also identical, numMatchPhrase is increased by 1

For every ground truth phrase which was not matched, numMissed is increased by 1.

 

The score for this image is calculated as a weighted average over the matching counters:

imageScore = max(-8 * numFp + numMatch + 2 * numMatchPhrase - numMissed, 0)

 

The total score across all images, sumImageScore, is the sum of each individual image score. The maximum possible total score is 3 times the total number of phrases in the ground truth, numPhrasesGt.

 

Final normalized score:

score = 1,000,000 * sum(imageScore) / (3 * numPhrasesGt)

 

We will provide the tester tool in the forum. If you identified any issue, please make a reply to that thread.

Final testing

This section describes the final testing work flow and the requirements against the /code folder of your submission. You may ignore this section until you decide you start to prepare your system for final testing.

 

To be able to successfully submit your system for final testing, some familiarity with Docker is required. If you have not used this technology before then you may first check this page and other learning material linked from there. To install docker follow these instructions.

Contents of the /code folder

The /code folder of your submission must contain:

  • All your code (training and inference) that are needed to reproduce your results.

  • A Dockerfile (named dockerfile, without extension) that will be used to build your system.

  • All data files that are needed during training and inference, with the exception of

    • the contest’s own training and testing data. You may assume that the contents of the /training and /testing folders (as described in the Input files section) will be available on the machine where your docker container runs, zip files already unpacked,

    • large data files that can be downloaded automatically either during building or running your docker script.

  • Your trained model file(s). Alternatively your build process may download your model files from the network. Either way, you must make it possible to run inference without having to execute training first.

 

The tester tool will unpack your submission, and the 

docker build -t <id> .

command will be used to build your docker image (the final ‘.’ is significant), where <id> is your TopCoder handle.

 

The build process must run out of the box, i.e. it should download and install all necessary 3rd party dependencies, either download from Internet or copy from the unpacked submission all necessary external data files, your model files, etc.

 

Your container will be started by the 

docker run -v <local_data_path>:/data:ro -v <local_writable_area_path>:/wdata -it <id>

command (single line), where the -v parameter mounts the contest’s data to the container’s /data folder. This means that all the raw contest data will be available for your container within the /data folder. Note that your container will have read only access to the /data folder. You can store large temporary files in the /wdata folder.

Training and test scripts

Your container must contain a train and test (a.k.a. inference) script having the following specification:

 
  • train.sh <data-folder> should create any data files that your algorithm needs for running test.sh later. The supplied <data-folder> parameters point to a folder having training image and annotation data in the same structure as is available for you during the coding phase. The allowed time limit for the train.sh script is 2 days. You may assume that the data folder path will be under /data.

  • As its first step train.sh must delete the your home made models shipped with your submission. 

  • Some algorithms may not need any training at all. It is a valid option to leave train.sh empty, but the file must exist nevertheless.

  • Training should be possible to do with working with only provided data and publicly available external data. This means that this script should do all the preprocessing and training steps that are necessary to reproduce your complete training workflow.

  • A sample call to your training script (single line):
    ./train.sh /data/training/
    In this case you can assume that the training data looks like this:
      data/
        training/
          images/

      annotations/

  • test.sh <data-folder> <output_path> should run your inference code using new, unlabeled data and should generate an output CSV file, as specified by the problem statement. The allowed time limit for the test.sh script is 12 hours. The testing data folder contain similar data in the same structure as is available for you during the coding phase. The final testing data will be similar in size and in content to the provisional testing data. You may assume that the data folder path will be under /data.

  • Inference should be possible to do without running training first, i.e. using only your prebuilt model files.

  • It should be possible to execute your inference script multiple times on the same input data or on different input data. You must make sure that these executions don't interfere, each execution leaves your system in a state in which further executions are possible.

  • A sample call to your testing script (single line):
    ./test.sh /data/test/ solution.csv
    In this case you can assume that the testing data looks like this:
      data/
        test/
          images/

 

Code requirements

  • Your training and inference scripts must output progress information. This may be as detailed as you wish but at the minimum it should contain the number of epochs processed so far.

  • Your testing code must process the test and validation data the same way, that is it must not contain any conditional logic based on whether it works on images that you have already downloaded or on unseen images.

Verification workflow

  1. First test.sh is run on the provisional test set to verify that the results of your latest online submission can be reproduced. This test run uses your home built models.

  2. Then test.sh is run on the final validation dataset, again using your home built models. Your final score is the one that your system achieves in this step.

  3. Next train.sh is run on the full training dataset to verify that your training process is reproducible. After the training process finishes, further execution of the test script must use the models generated in this step.

  4. Finally test.sh is run on the final validation dataset (or on a subset of that), using the models generated in the previous step, to verify that the results achieved in step #2 above can be reproduced.

 

A note on reproducibility: we are aware that it is not always possible to reproduce the exact same results. E.g. if you do online training then the difference in the training environments may result in different number of iterations, meaning different models. Also you may have no control over random number generation in certain 3rd party libraries. In any case, the results must be statistically similar, and in case of differences you must have a convincing explanation why the same result can not be reproduced.

Hardware specification

Your docker image will be built and run on a Linux AWS instance, having this configuration: 

  • m4.2xlarge

Please see here for the details of this instance type.

General Notes

  • This match is rated.

  • Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.

  • Teaming is not allowed. You must develop your solution on your own. Any communication between members beyond what is allowed by the forum rules is strictly forbidden.

  • In this match you may use any programming language and libraries, including commercial solutions, provided Topcoder is able to run it free of any charge. You may also use open source languages and libraries, with the restrictions listed in the next section below. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM. Submissions will be deleted/destroyed after they are confirmed. Topcoder will not purchase licenses to run your code. Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.

  • You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client.

  • If your solution includes licensed software (e.g. commercial software, open source software, etc), you must include the full license agreements with your submission. 

All software must be available for commercial use. Include your licenses in a folder labeled “Licenses”. Within the same folder, include a text file labeled “README” that explains the purpose of each licensed software package as it is used in your solution.

  • External data sets and pre-trained models are allowed for use in the competition provided the following are satisfied:

  • The external data and pre-trained models are unencumbered with legal restrictions that conflict with its use in the competition.

  • The data source or data used to train the pre-trained models is defined in the submission description.

  • Same as the software licenses, data must be unrestricted for commercial use.

Final prizes

In order to receive a final prize, you must do all the following:

  • Achieve a score in the top five according to final system test results. See the "Final testing" section above.

  • Once the final scores are posted and winners are announced, the prize winner candidates have 7 days to submit a report outlining their final algorithm explaining the logic behind and steps to its approach. You will receive a template that helps creating your final report.

  • If you place in a prize winning rank but fail to do any of the above, then you will not receive a prize, and it will be awarded to the contestant with the next best performance who did all of the above.

Reliability Rating and Bonus

For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
Read more.