Challenge Overview

Welcome to Panel Schedule Importer – Image De-Noise & In-Paint challenge. This contest is part of the Panel Schedule Importer series where our client attempts to extract data from PDFs and images using OCR and Machine Learning.  The goal of this challenge is to remove one or more types of annotation artifacts that obstruct the OCR process.  We refer to these artifacts as “noise”

 

This contest has bonus payments associated with it. Please read the specification below.

 

Project Overview

Electrical engineers send our client, a Fortune Global 500 company specializing in energy management, technical documents (in PDF format) that include one or more electrical panel configuration(s). Configuration data within these documents describes the type and amp rating of circuit breakers, a description of the equipment served by a breaker, and general specifications for the panel itself. These sections of the document are called schedules and are depicted in table format.

 

Importing data from the schedule tables is difficult for a few reasons. For one, various engineering firms use different table formats to articulate this information. Additionally, there is significant variation in the quality and/or completeness of table images.

 

This project seeks to automate the task of identifying the schedules in the PDFs and extracting meaningful data from it.

 

Contest Details

This is the third challenge in the Panel Schedule Importer series where we are attempting to reduce the amount of human intervention required to extract panel specifications from technical documents to prepare customer quotes.

 

Panel schedules within a PDF are sometimes delivered in a ‘clean’ format that is ready for immediate OCR processing. In some cases, however, schedules are annotated with superimposed notes (usually digital graphics created with PDF editing software – but may also be handwritten notes) that obscure important text and interfere with OCR.

 

For this challenge, we will provide you with 36 sample images in BMP format. Each image contains one or more schedule tables. Your task is to provide a solution for improving image quality and preparing images for OCR data capture. Images must be processed to remove (or reduce) background noise in order to improve the accuracy of OCR.

 

There are several common types of noise that need to be removed, including:

  1. bubble-shaped graphic overlays

  2. geometric shape graphic overlays (which typically have a number at the center)

  3. background noise/ poor contrast as the result of a poor image scan

  4. handwritten text and/or shapes

 

Considering the process of removing these noise artifacts as applying filters, you can thus apply 4 filters to the images provided, that remove the different types of noise. For this contest, you are required to implement one of the first two mentioned above (bubble shaped and geometric shape). Implementing just one filter is fine. However, implementing both will result in a bonus payment of $600. These will be reviewed under Major Requirements in the scorecard. Additionally, the third and fourth filters are optional (background noise and handwritten text) and if implemeneted, they will be reviewed under Minor Requirements in the scorecard. Also, each attract a bonus payment of $150. Thus, if your submission receives first or second place, you are eligible to get the bonus in addition to receiving payment associated with the first or second place.

 

This implies that not all 36 images that we share will be relevant for your submission. Only those images that correspond to the filter category implemented by your submission will be used during review.

 

Additionally, these overlays and other forms of noise sometimes obscure text content or cause it to be unreadable by OCR. Similarly, the process of reducing “noise” may itself cause text content to be lost. In these situations, an algorithm/ Machine Learning approach should be used to automatically detect for missing content and attempt to “fill-in” any missing data. This is optional. If carried out, this attract an additional payment of $400 if the submission receives first or second place. This is in addition to receiving payment associated with the first or second place.

 

You will find examples of the types of noise in a document shared in the contest forum.

 

The goal of this challenge is to provide an image pre-processing application to aid in downstream OCR processing. The following article details a similar application, written in Python, for use with Tesseract OCR – and may serve as a resource for competitors: http://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/

 

For this contest, a successful submission should perform the following functions:

  1. Accept a rasterized image (.BMP format) as input

  2. Test/evaluate image quality to determine if annotations, overlays or other background noise are contained in the image and which interfere with legibility of text

    1. If image does not contain background noise, proceed with next stage of processing

    2. If image contains background noise, remove or reduce all noise that would interfere with/ reduce the accuracy of OCR & proceed with next stage of processing

  3. Test/evaluate the resulting “de-noised” image for missing or obfuscated text

    1. If image is ‘complete’ (does not contain missing content or obfuscated text), flag the file as ‘ready’ for OCR processing

    2. If critical text content is missing from the image, use surrounding context to populate missing data with appropriate content using a “best match” approach

  4. Output the fully processed image in BMP format, preserving the native image dimensions and resolution

  5. Output a CSV log file that contains details on the image processing. The minimum columns should include Original Filename and Output Filename, as well as indicators whether the image contained noise and whether text was missing/ required populating. Optionally, your log file could include data quantifying the “amount of noise” detected -- and/or a confidence score for re-populating text. An ideal solution might also contain additional details like image dimensions and file size.

 

Points to Note

  • Since your solution will likely involve complex dependencies and environmental setups, it is desired that your solution includes a Dockerfile. This is not required however it should eliminate environmental discrepancies that may prevent judges from reviewing your submission.

  • You are only allowed to make use of MIT licensed, BSD licensed, Mozilla licensed or Apache licensed libraries in your solutions.

  • Optional: You may wrap your solution in a script that will cycle through the entire set of 36 BMP images or you may allow for it to be called one time for each BMP.

  • Note that you can submit your solution in Python, Java or C#

  • The major requirements of this contest are:

    • Implement the bubble shapes and / or geometric shape removal filters. In implementing the filters, you are expected to reduce or remove the noise that would reduce the accuracy of OCR processing of the image in future stages.

    • Output the processed image in BMP format, preserving the native image resolution and dimensions

    • Output a CSV log file that contains details on the image processing. The details of the columns in the log file have been described above.

    • If implemented, the optional fill-in feature - that of filling in missing content after reducing noise will also be a major requirement. If not implemented, considering it to be optional, reviewers will not deduct scores.



Final Submission Guidelines

  • Include a detailed deployment guide (a README file is also fine as long as it contains deployment instructions) along with your source code and upload it to Topcoder.

  • Don't forget to include an unlisted link to your video that shows your solution in action.

  • Do mention which among the four filters did you provide.

ELIGIBLE EVENTS:

2018 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30058878