Welcome to the LEIKA Regulation Recommendation Engine challenge! This is the first ever Topcoder data science challenge of a new type, called Beacon, for which detailed rules are explained below. In a few words, we ask for community feedback on the client problem, the feasibility to solve it through a Topcoder competition, and what parameters of that competition should be to amuse the community, and achieve the outcomes expected by the client.
The Client ProblemThe client has a collection of PDF documents with textual content, structured in a predictable way. There are about 60 documents, about 10 chapters each, with each chapter having about 20 paragraphs. Their content describes company protocols (regulations) for employees in different roles, e.g. to assess and manage risks in everyday business operations the client has in staff Risk Officers, Chief Risk Officers, and Heads of Risk. The protocols document their responsibilities for each of the roles, how they should approach different situations, who everybody reports to, who is responsible for who, etc. A few sample documents are provided in the challenge forum in DOCX format (within the client’s workflow, these documents are created in Word, hence DOCX, and later released for the internal consumption in PDF format; to facilitate solution development we can provide samples as DOCX sources, but it is expected that the final solution developed by the end of the entire project should be able to handle released PDFs without having access to their DOCX sources).
The client is interested in development of an artificial intelligence (AI) system able to split these documents into fragments, to learn, and further deduce the associations between the fragments and employee roles related to these fragments, and finally to be able lookup by the user role the regulatory instructions relevant to them. In other words, a search engine which, based on the user roles, finds in the company legal regulations instructions applicable to the user. In the data science part of this project we specifically focus on the AI and machine learning part of such a challenge.
The user should be able to provide AI with the feedback on the search results, which should be accounted for by AI in future searches. This implies both when a user says that some of the search results are irrelevant, and thus similar results should not show up in future searches for him (or probably for all users with such role), as well as the other way around, the user should be able to manually specify regulatory fragments relevant to him, and not suggested by the engine, so that the AI remembers it, and in future searches it automatically includes similar fragments.
A few extra points:
- To be concise we talk about “fragment to role employee role” mapping all around the documentation, but we imply that our solution will be able to support the mapping and keeping track of additional related information, e.g. we may want it to associate document fragments with the following data:
- A continuous fragment of text from an input document.
- References to the document, its chapter, page, subsection, where this fragment comes from. Titles of these document, chapter, subsection.
- The business branch relevant to the fragment (e.g. entire company, global management, investment division, etc.)
- The employee role relevant to the fragment (e.g. head of business, head of business area, risk manager, etc.).
- Job functions relevant to the fragment (e.g. risk management, audit, finance, etc.)
- Suggested ground truth data: paragraphs mapped to the roles by a human. Currently about 100 data points.
- Preferred tech stack: Python 3.6 on the data platform Dataiku on MS Azure Double. At the same time, the client does not oppose alternative technologies.
Beacon Challenge RulesOnce registered to the challenge, look into the challenge forum. A number of discussion threads are open there with questions about the feasibility of solving the client's problem with the data they have, and meeting their expectations. If deemed feasible, the intention is to prepare and run the main competition as “first-to-finish” data science competition: the first solution to achieve the set performance threshold will win. As a part of the present Beacon challenge we want to discuss what the best way will be to benchmark solution performance, and what the winning threshold should be.
To participate in this Beacon competition you just provide your thoughts in the forum threads, and participate in the discussion there. You are encouraged to upvote or downvote the ideas of other participants (while keeping in mind Topcoder Code of Conduct). As the discussion progresses, the copilot will draft, iteratively elaborate, and share with you the challenge details for the main competition, with the idea that working together we can count on your feedback, and further improve them. This iterative work will continue until the discussion, and preparation converges to the final rules of the main challenge; or the project is deemed infeasible.
To award your participation in this Beacon competition, the total prize pool $900 will be distributed by the copilot among the active and most useful contributions into the discussion, based on both on your up- and down-votes in the forum, and also based on the subjective copilot judgement (notice: the header of challenge page shows the prizes as $750 first and $250 second place, due to the platform limitations, but the actual prize distribution will be on the copilot's discretion, as said here). Please keep in mind, that participating in this Beacon competition you not only get the chance to earn some prize right away, but also contribute substantially to the future main challenge, which is beneficial for the entire data science segment of Topcoder community, and you personally, if you decide to take part in the main challenge.