The goal of the ideation is to find public data sources about women throughout history and suggest how that data could be used to gain new insights for AI products and solutions with a concentration on women. The primary focus is to get a well thought-out overview of what is possible, what relevant data can be found and where, and how the data should be collected. The actual collection of data will be the subject of a future dataset challenge.
IVOW utilizes machine learning to identify and segment consumer audiences using public data around holidays, festivals, food, music, and arts. IVOW also sources new data via crowdsourced competitions. Their chatbot, Sina, and smart tool, CultureGraph (currently in Alpha), help enterprises to better segment, identify, and analyze consumer audiences; customize consumer messaging for diverse audiences; unlock first-party data and new revenue streams and areas of growth.
Our client aims to bring cultural and historic awareness into the world of artificial intelligence (AI) systems, data science, and machine learning (ML). Imagine future chatbots, smart speakers, and other conversational interfaces that are able to truly understand your culture and traditions, and can tell you interesting stories about your family and heritage. Think about ML models helping to better evaluate customer needs, and improve experiences based on the culture of your community and people.
Tailored datasets, suitable for development and training of culturally aware systems and models, could open many futuristic and exciting possibilities in customer service, healthcare, hospitality, tourism, banking, and other sectors. The focus on women is particularly important given the well-documented biases in current AI and ML datasets. Our client’s mission includes creation of such datasets, and today we are going to ideate around the development of a Women in History dataset.
Thanks to our sponsors:
Challenge RequirementsThe vision for the Women in History dataset is to include profiles for historic (famous public figures, artists, scientists, etc.) and fictional (folklore, mythology, literature, etc.) female figures from all eras and cultures around the world. A profile will include the name(s) under which the persona is known; a textural abstract about 100 words telling who she is and what she is known for; country of birth or region associated with (in context of myths); and links to the information source. Additional data can be incorporated later, but for now we are concentrating on these elements.
A successful response to the challenge would be a machine readable format focusing on women characters, which can be used, for example, to train a conversational model so that it can maintain a conversation and tell stories about these personas. In the challenge forum, you will find some samples of manually-created profiles that meet our needs.
The questions we would like you to explore within this ideation include (but are not strictly limited to):
- What are the possible sources from which such data can be collected in an automated way? What licenses are applied to the data collected from each source? What are the legal and technical challenges, and limitations of data retrieval/scraping from each source? How do we ensure we are getting high quality, accurate data?
- For this phase, we are specifically soliciting stories or sources in English, Spanish, French, Arabic, Persian. Please note your final report must be in English even if your examples are in another language.
- Many of these stories exist in text format in original languages, or they might be oral histories, paper archives, or even unstructured Wikipedia entries. How can the digital transformation of these stories into AI-suitable datasets represent an important step in helping future machines become aware of global cultures? Are there tools to help us with cultural contextualization?
- How can we structure collected data in the required format (e.g. how do we extract text abstracts ~100 words long from larger articles)? Can we collect some data from audio and video recordings, and then convert it to text format?
In this fairly open-ended challenge, we ask you to consider these questions, as well as any other questions which may be relevant for you, and submit as a deliverable a concise written report about the outcomes of your research, your findings, and ideas. We encourage you to include some sample data from the sources, along with proof-of-concept code/scripts for data retrieval and processing -- submissions that include these will definitely get extra points during the review. Please remember that the primary focus is to produce a well thought-out overview of what is possible, what relevant data can be found and where, and how the data should be collected. The actual collection of data will be the subject of future potential challenges.
Each story should have:
- a unique ID
- an original source (where content was found)
- a cultural source (ethnicity)
- a time period tag indicating if they are modern (contemporary), historical (story from the past, but not as applicable today), or mythical (a legend or piece of folklore) and
- a relations tag (story or source might be related to technology, science, medicine, art, etc)
ScoringSubmissions to this challenge will be judged subjectively by Topcoder and client teams, based on their relevance to the goal explained above and the quality of the research performed. Some attention will be given to the quality of the presentation; we do not expect you to put any special efforts into the visual design of your report, other than a clean and professional look.
As usual, should you have any doubts, questions, suggestions, please do not hesitate to ask in the challenge forum!
Final Submission GuidelinesReport must be in English. Submit a ZIP file including the PDF version of your report, along with all source files you used to generate that PDF, and optionally any other relevant material you want us to see.
The report should consist of the following sections:
- Overview: Describe your approach in “layman's terms.”
- Methodology: Describe what you did to come up with this approach, e.g., literature search, experimental testing, etc.
- Materials: Did your approach use a specific technology? Any libraries? List all tools and resources you used.
- Discussion: Explain what you attempted, considered, or reviewed that worked, and especially what didn’t work or what you rejected. For any method that didn’t work, or was rejected, briefly include your reasoning (e.g., needs more data than we have). If you are pointing to someone else’s work (e.g., citing a well known implementation or literature), describe in detail how that work relates to this project, and what modifications would be needed.
- Data: What other data should one consider? Is the data in the public domain? Is it derived? Is it necessary in order to achieve the aims? Also, what about the data described/provided -- is it sufficient?
- Assumptions and Risks: What are the main risks of your approach, and what are the assumptions you/the model is/are making? Are there pitfalls in the dataset or approach?
- Results: Did you implement your approach? How did it perform? If you’re not providing an implementation, use this section to explain the EXPECTED results.
- Other: Discuss any other issues or attributes that don’t fit neatly above that you’d also like to include.