1st place - $3000
2nd place - $2000
3rd place - $1500
4th place - $1000
Objective and Description
We are looking for a scalable solution for extracting information from Warranty Deeds. Specifically, a Vesting phrase describing the grantee, and a Legal Description phrase describing the property (see Example below). The provided data will include OCR (i.e., optical character recognition) blobs of text with spatial metadata, information about which tokens are tagged as entities, and corresponding text blob for extraction (without spatial metadata). Contestants may choose to leverage grammatical cues within the OCR content as needed.
Note that you will NOT have access to the original images.
There are 9957 data observations in total. Each observation will have associated OCR blobs and spatial metadata, tagged entities data, and Vesting and Legal Description OCR data for the target extraction.
For the evaluation purpose, we randomly split the data observations into 2 sets: (1) Training Set and (2) Test Set. The training Set has 7965 data observations. You are asked to use these for your model’s training. The remaining data observations form the Test Set. The test set is further randomly split evenly into the Provisional Test Set and Final Test Set, ending up with 996 data observations each.
You can download all the OCR blobs and the labels in the Training Set here. If you want to test your submission locally, you may want to further split the Training Set.
Some notes about the dataset: For each of the provided observations, you should find the following information, connected by unique ID:
Blob: It is the aggregated blob of text for all the pages in the document, pulled from tesseract ocr algorithm
Aggregated OCR: has the coordinate of bounding box info for each token on the page, page_number, confidence, language, etc.
Formatted entities: contains only the entities identified from that observation by using AWS Comprehend (Named entity recognition) with some formatting rules applied and the confidence score for that entity tagged under a category.
You may also see ‘raw_text_unformatted’ which is the raw text extracted from ocr, while ‘value’ is after applying some internal formatting rules
The OCR blobs will come from images such as this:
The expected Vesting phrase for above document: “Gerald M. Finnegan and Marie E. Finnegan, as Trustees of the Finnegan Family Trust, u/t/d January 8, 2010, a revocable living trust”
The expected Legal Description phrase for above document: “The following described land, situate, lying and being in Pasco County, State of Florida: Lot 27 of Fox Wood Phase One, according to the plat thereof, as recorded in Plat Book 34, Pages 54-70, of the Public Records of Pasco County, Florida.”
Your solution must be a function which accepts as input: the OCR blob, spatial metadata, and tagged entities data. The function must output the target extractions, Vesting and Legal Description.
The format of your output should follow the training_label.csv with proper column names.
Running Time Constraints
The inference of all test cases in Test Set should be able to finish in 5 hours on AWS EC2 p2.xlarge instance. The training must be able to finish within 48 hours on AWS EC2 p2.xlarge instance.
During the challenge, the leaderboard score will be computed based on the Provisional Test Set. After the challenge ends, we will switch to the Final Test Set and then update the leaderboard to pick the winners.
If you don’t have any extraction results from some OCR blobs, your scores on these OCR blobs will be 0.
For a single data observation, suppose the groundtruth phrase is X, and your prediction is Y, the score will be
Score(X, Y) = 100 * LCS(X, Y) * 2 / (|X| + |Y|)
Here, LCS(X, Y) refers to the longest common substring (at the token level) between X and Y. We will split the strings by whitespaces. |X| and |Y| are the number of tokens in these two phrases, respectively. For example, when X is “This is an example” and Y is “This is another example”, the LCS(X, Y) is 3 (i.e., “This is example”), and |X| = |Y| = 4. Therefore, Score(X, Y) = 100 * 3 * 2 / 8 = 75.
Due to some efficiency consideration, LCS(X, Y) is defined to be 0 if your prediction Y is more than 10 times longer than the expected result X.
We will compute the average score across all data observations and vesting and legal phrases. More details can be found in the scorer here. The scorer requires a list of IDs for different phases (Training, Provisional, and Final). It takes command line arguments for those input filenames. Please refer to the code for more details.
Requirements to Win a Prize
In order to receive a prize, you must do the following:
Achieve a score in the top 4, according to system test results calculated using the test dataset.
Within 7 days from the announcement of the contest winners:
Submit a complete 2-page (minimum) report that: (1) outlines your final algorithm and (2) explains the logic behind and steps of your approach.
Properly annotate and comment your code
If you place in the top 4, but fail to do all of the above, then you will not receive a prize, which will be awarded to the contestant with the next best performance who completed the submission requirements above.
This match uses a combination of the "submit data" and "submit code" submission styles. Your submission must be a single ZIP file with the following content:
/solution/solution.csv is the output your algorithm generates on the Test Set, including both Provisional and Final. The format of this file should follow the training_label.csv with proper column names.
/code contains a dockerized version of your system that will be used to reproduce your results in a well-defined, standardized way. This folder must contain a dockerfile that will be used to build a docker container that will host your system during final testing. How you organize the rest of the contents of the /code folder is up to you, as long as it satisfies the requirements listed below in the Final testing section.
Here is an example submission zip file. It should achieve a score about 2.614 in the Provisional test.
Please carefully document your solution, so we can easily repeat your solution on an AWS VM. Your docker image will be built and run on a Linux AWS instance, having this configuration:
Please see here for the details of this instance type.
This match is unrated.
Use the match forum to ask general questions or report problems, but please do not post comments and questions that reveal information about the problem itself or possible solution techniques.
Teaming is not allowed. You must develop your solution on your own. Any communication between members beyond what is allowed by the forum rules is strictly forbidden.
In this match you may use any programming language and libraries, including commercial solutions, provided Topcoder is able to run it free of any charge. You may also use open source languages and libraries, with the restrictions listed in the next section below. If your solution requires licenses, you must have these licenses and be able to legally install them in a testing VM. Submissions will be deleted/destroyed after they are confirmed. Topcoder will not purchase licenses to run your code. Prior to submission, please make absolutely sure your submission can be run by Topcoder free of cost, and with all necessary licenses pre-installed in your solution. Topcoder is not required to contact submitters for additional instructions if the code does not run. If we are unable to run your solution due to license problems, including any requirement to download a license, your submission might be rejected. Be sure to contact us right away if you have concerns about this requirement.
You may use open source languages and libraries provided they are equally free for your use, use by another competitor, or use by the client.
If your solution includes licensed software (e.g. commercial software, open source software, etc), you must include the full license agreements with your submission.
All software must be available for commercial use. Include your licenses in a folder labeled “Licenses”. Within the same folder, include a text file labeled “README” that explains the purpose of each licensed software package as it is used in your solution.
External data sets and pre-trained models are allowed for use in the competition provided the following are satisfied:
The external data and pre-trained models are unencumbered with legal restrictions that conflict with its use in the competition.
The data source or data used to train the pre-trained models is defined in the submission description.
Same as the software licenses, data must be unrestricted for commercial use.