One of the hurdles for any data science competition or internal development effort using machine learning technologies is to establish a baseline that you can use to validate your solution. In the video below, Topcoder’s Data Science Administrator, Tim Kirchner, describes the importance of ground truth information for predictive analytics and pattern recognition challenges at Topcoder.
Challenges with optical character recognition
I was recently faced with the challenge of generating ground truth information related to a set of images in order to improve and validate the optical character recognition (OCR) capabilities of an application being developed by the Topcoder Community. First, I examined the image tagging capabilities available in the marketplace; no point in building something custom if you can simply buy or rent that capability for a fraction of the time and cost. However, what I found was surprising. For this data set, I needed to find a designated set of text phrases and establish a bounding box around each phrase present on each image.
Comparing platforms: Amazon Mechanical Turk and CrowdFlower
The first vendor I reviewed was Amazon Mechanical Turk (AMT). AMT is cost-effective because you pay only for transactions that you execute. They also have a large crowd available to complete work quickly. But for my particular task, the ability to capture the information I needed wasn’t readily available. The capabilities of the platform are somewhat limited and the ability to capture both text and the bounding boxes required didn’t seem to exist in any of their templates.
The CrowdFlower platform also has some interesting image tagging capabilities. I was able to set up a template to get my images tagged along with the appropriate text, but after 1,000 transactions, the trial license for the platform expired and CrowdFlower requested a significant platform fee on top of the transaction cost required to pay workers. In my situation, the cost was prohibitive.
Cost-effective results through the Topcoder Community
What next steps? Our client didn’t have the personnel or marked images in the proper format to rely on. Fortunately, crowdsourcing is a very flexible tool, so the Topcoder Community came to our rescue. We ran two challenges — the first to create our image marking tool, a desktop Python QT application that interfaces with MySQL and Amazon S3; the second to mark the images.
Over the course of roughly two weeks, eight Topcoder members tagged 27,850 phrases on 475 images. And now, in addition to the valuable data set, my client has a custom image marking tool to use for their internal tagging efforts in the future. On Topcoder, we generated the original Python client that we used to mark the images for member payments of about $3,000; the member cost to mark and review the images was about $6,000 — and total cost to develop the tool was far less than the platform fees on CrowdFlower. The client was so pleased with the marking tool that they decided to spend another $3,000 in member fees to include some additional functionality, making the tool more robust and incorporating new features. It was a remarkable effort.
Ready to leverage our expert data scientists?