ico-arrow-big-left

Word-level Language Identification Python Challenge

Key Information

Register
Submit
The challenge is finished.
Show Deadlines

Challenge Overview

Challenge Objectives

The objective of this challenge is to build a Python based tool that can take an input text, detect the language of each word and return the text, with each word annotated with the identified language.

Here, the target languages can be English, Hindi and Spanish. Here the Hindi will NOT be in Devnagri script, it will be in the Romanized script (i.e. it can be read by anyone who can read English), and the same goes for Spanish. The Spanish text, however, can have accented Spanish letters.

The script primarily needs to do the following:

  • Take a list of text as input: The tool should be able to take as input a list of text, where each text piece should be processed independently.

  • Perform language detection (only English, Hindi and Spanish): For each piece of text in the list, perform language detection on each word

  • Return the list of annotated text: Once detection is complete, the list of text with each word language-annotated, should be returned by the tool.

Project Background

The goal of the project is to create an API that can take as input text, and perform langage detection on the list at the word-level. The target languages are Hindi, English and Spanish.

To elaborate further, in this project we intend to build an API, which takes as input a piece of text (user utterance for a chatbot that the client is building internally) and identifies what language the user is communicating in.

Note - in this challenge we are NOT building the API itself. In this challenge, we are focusing on building the core algorithm that can perform language detection with high objective performance. Once this core algorithm is built, the API will be built around it in the next challenge.

Challenge Details

The following are the requirements of the challenge, in a bit more detail:

  • Take a list of text as input: The tool should be able to take as input a list of text, where each text piece should be processed independently. Here, can be anywhere from 1 word to several words. We don't have an upper limit on the number of words in a piece of text, but the length of will generally not be very long. The text will generally be inputs to a chatbot that the client is building internally.

  • Perform language detection (only English, Hindi and Spanish): For each piece of text in the list, perform language detection on each word. The detection should be performed via the following steps:

    • Clean user utterance text, i.e. remove any unnecessary special characters.
    • Identify the language of each constituent word in the text. The languages can be English, Hindi or Spanish.

    Important Note - each text can contain these combination of languages, such as English-only, Spanish-only, Hindi-only, English + Hindi, English + Spanish. Hindi + Spanish is NOT expected.

  • Return the list of annotated text: Once detection is complete, the list of text with each word language-annotated, should be returned by the tool. For reference, this how the output of each piece of text should look like:

    * Input text: Hola!! Good morning!!
    
    * Clean text (Intermediate step): Hola Good morning
    
    * Output/ Language Annotated text: Hola_spa Good_eng morning_eng

Here, after each word, we append the language code, which is in the format _language-code. The language codes are as follows:

English (eng)
Hindi (hin)
Spanish (spa)

Error Handling and Configurability

The code should have appropriate checks and error-handling to ensure that the code execution continues even in case there is any error during the detection process. This is because the future API that will use this code should not stop working because of any error in this detection process.

Note that in case some configurable values are used, these values should be configurable using a config.json file. In general, any value that can be configured at the entire future API level (and not in the individual input-output level), should be put inside the config.json file.

Review Criteria

The review will be performed in the form of a combination of subjective and objective review:

  • In the objective review - The objective review criteria for determining the performance of the language detection will be shared in the forum shortly after the launch of the challenge.

  • In the subjective review, the documentation will be subjectively reviewed to understand the strategy used to generate the output. The code will also be reviewed to gauge the overall code quality, such as proper amount of comments, code structure. The submitters are expected to write readable code with appropriate comments, particularly in non-trivial sections of the code base. Also, the documentation should be clear and easy to follow.

  • The weighage would be as follow:
    • Objective Review - 60%
    • Subjective Review - 40%

Technology Stack

  • Deployment Environment - Deployable on any platform that can execute Python via a terminal or command line

  • Language - Python 3 (preferably 3.9+)

What To Submit

  • Source code - The source code of the script
  • Documentation - Detailed instructions to deploy the solution from scratch. It should also detail the approach used in the solution.
  • Verification Instructions - Steps to verify the solution should be communicated. This can be via text, or preferably via a demo video.

ELIGIBLE EVENTS:

2022 Topcoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board
?

Approval:

User Sign-Off
?

ID: 30234208