• Compete
  • Learn
  • Community
ico-magnifying_glass

    Quartz Energy Mud Log OCR Optimization

    PRIZES

    1st

    $2,000

    2nd

    $1,000

    3rd

    $500

    Register
    Submit
    The challenge is finished.
    Show Deadlines icon-arrow-up

    Challenge Overview

    According to Wikipedia, “Mud logging is the creation of a detailed record (well log) of a borehole by examining the cuttings of rock brought to the surface by the circulating drilling medium (most commonly drilling mud).”  Quartz Energy has provided Topcoder with a set of mud logs and we’re developing an application to extract structured meaning from these records.  The documents are very interesting -- they are even oil-well shaped!  You can read more details about them here.  For this challenge, 101 mud log image files are being offered as training data.  You can download this data set here.   

    If oil is revealed in a well hole sample, a “Show” may be recorded in logs.  This is one of the most important pieces of information in the mud logs.  In a previous challenge, Topcoder member, chok68, produced the winning submission which we’re going to use as our baseline OCR solution.   The code for the previous challenge can be found here.

    Here is what the existing application already does:

    1. Creates a MySQL database designated by DB_DATABASENAME parameter in the .env file.

    2. Iterates through all the mud log images in a directory designated on the command line.

    3. Extracts the raw text from each mud log image file.  

    4. Store the raw text in a database along with the mud file image name.

    5. Gives each image file a score based on the number of occurrences of the show phrases identified in the raw text and store the relevant phrases and scores in the databases

    6. Creates a summary report of the image filenames, scores, raw text, and extract phrases sorted by score descending.

    7. Creates a graph/plot which displays the highest scoring images files.  

    8. Full instructions on how to set up and execute the solution can be found in the ReadMe file in the root directory of the submission.  You can download the submission here.

    Notes:

    Of all the tasks outlined above, task #3 is by far the most difficult.  Many of the images are poor quality and the text is provided in a variety of different fonts and layouts.

    New Requirements

    In this challenge, we’re going to building on the previously built solution and adding a few new requirements:

    1. Please add the following columns to the database schema:

      1. IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE CHARACTER(10) NOT NULL

      2. IMAGE_OCR_PHRASE.OCR_PHRASE_COUNT INT NOT NULL DEFAULT 0

      3. IMAGE_OCR.PHRASE_COUNT INT NOT NULL DEFAULT 0

    2. There are 4 valid values for the IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE field:  Show, Stain, Trace, Negative

    3. The scoring rubric has been updated to include some new terms and one additional type of phrase to identify: a negative phrase.  Here is the revised scoring rubric.  The current solution does NOT implement the negative case scoring, you will be able to improve on the baseline solution scoring simply by implementing the negative case.   A negative phrase type is just a Show, Stain, or Trace phrase with the word "no" in front of it.  All phrases are case-insensitive.

    4. Your solution should populate the three new database fields requested in #1 above.   Each phrase of any type (Show, Stain, Trace, or Negative) counts as 1 in the IMAGE_OCR.PHRASE_COUNT. If you identify 3 phrases in an image:

         Oil Stain
         Oil Stain
         Oil Stain

          Here are the values for the various fields for this case:
           IMAGE_OCR.PHRASE_COUNT = 3
       IMAGE_OCR.SCORE  = 6           
       
    IMAGE_OCR_PHRASE.OCR_PHRASE_TYPE = “Stain”
       
    IMAGE_OCR_PHRASE.OCR_PHRASE_COUNT = 3
       
    IMAGE_OCR_PHRASE.OCR_PHRASE = “Oil Stain”
       
    IMAGE_OCR_PHRASE.SCORE = 6

    1. Topcoder has manually inspected about 200 image files to determine the ground truth data that will be basis for scoring the submissions.  We’re providing a subset of this data as training data.  The evaluation of the solutions will occur against both the training data and an additional set of testing data that isn’t being provided in advance.  Here is the ground truth data for the images provided in the training data set.  We are scoring for the inclusion of the negative phrases in the accuracy count.  Although they don’t affect the scores of the records in the IMAGE_OCR.SCORE column, the negative phrases should appear in your IMAGE_OCR.PHRASE_COUNT totals.

    2. The scoring will be conducted based on accuracy against the phrase counts.  Your application must find each of the phrases.  We’ll score each submission based on sum of all the distances between the Phrase Counts between the submission and ground truth data for all the images in the testing data set.  The submission with the lowest score will be the most accurate.

    3. The submissions will be compared against each other for accuracy based on the score described above and ranked in accuracy order - the most accurate (the lowest cumulative distance score) receiving a 10 and the next receiving a 9 in the performance element of the score card.  The theoretical perfect accuracy score is 0, which would receive a final score of 10.  Please review the scorecard to see the weighting of the performance characteristics.  It is possible that we’ll have a tie in the accuracy scoring and we’ll allow a tie in that element of the competition.  Although the accuracy elements of the competition will be heavily weighted, meeting the functional requirements, and good coding style and practice are important and could be decisive in the competition.

    4. Produce new images in an output folder, using the same file names, that highlights the phrases you found as discussed here: https://stackoverflow.com/questions/20831612/getting-the-bounding-box-of-the-recognized-words-using-python-tesseract.   Ideally, these highlights are color coded:  Green, Light Green, Yellow and Red to agree with the phrase type of the terms. Please see the scoring rubric for details on the Phrase Types.   

     

     

    Technology Overview

    Python 3.6.x

    MySQL 5.7.+

     

    Final Submission Guidelines

    1. Please submit all code required by the application in your submission.zip

    2. Document the build process for your code including all dependencies (pip installs etc..).  Please make updates to the existing README.md file as needed to allow for straightforward deployment of your solution.

    3. You may use any Python Open Source libraries or technologies provided they are available for commercial use. 

    4.  Your solution will be deployed to an AWS 16.04 Ubuntu Instance.

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.

    ELIGIBLE EVENTS:

    2017 TopCoder(R) Open

    REVIEW STYLE:

    Final Review:

    Community Review Board

    ?

    Approval:

    User Sign-Off

    ?

    CHALLENGE LINKS:

    Review Scorecard