Welcome to Hitachi - Time-Series Data Labeling Architecture and Algorithm Ideation Challenge. The client, Hitachi Ltd is a well-known Japanese multinational conglomerate company, one of the leading companies in electronics manufacturer and ICT.
In today’s world, Machine Learning (ML)/AI is becoming an important technology as never before. But one of the biggest bottlenecks in the development of ML is the necessity to provide a large set of training datasets that in most cases needs labeling by hand.
To overcome this issue, there are multiple new technologies to programmatically build and manage training datasets such as Snorkel and Fonduer originally developed by HazyResearch at Stanford. Although these are mostly focused on text data.
In this challenge, the client is looking for an innovative idea (new architecture and algorithms) with sample code to programmatically label time-series data.
We will be looking for 2 items in this challenge.
- Report to explain your new idea (architecture and algorithm)
- Sample program implementing your idea to extract features from training data and label time-series CSV data that we provide.
Please see details of each item below the Goals and Submissions section.
BONUS: We will be having Checkpoint to award 5 for $100 each. Please see the “Checkpoint” section below for the details. The client has the sole right to offer additional prizes if there’s great potential in your idea even if you did not make it in top 5.
Background on Time-Series Data Labeling
AI analysis of time-series data obtained from various sensors then leverage those analysis to lead to business value such as productivity improvement and cost reduction are nowadays getting important as never before. For example, by analyzing time-series data obtained from a vibration sensor or a temperature sensor attached to a manufacturing equipment, then detecting equipment failure or a sign of failure, can be used to reduce product loss due to spoiled work and/or losing a business opportunity.
However, in analyzing those time-series data with AI, obtaining a large set of training data can become a major obstacle. Specifically, the time-series data is just data having time and value sets, and a section in which an abnormality has occurred or a section in which a sign of abnormality has been captured are not usually specified. Therefore, it is necessary for humans to identify abnormal sections or sections that indicate signs from the time-series data, and manually label such as “Abnormality occurrence” or “Abnormal signs”, which takes an enormous amount of time.
Below is an image diagram of the labeling process.
Please note that the above data is just an example for understanding purpose. The actual CSV file has multiple sensor data within the times and you will need to cut, extract, and label based on these multiple data (columns).
The goal of this challenge is to come up with good architecture and algorithm to programmatically label huge time-series data using smaller training data.
Your idea (and program) also needs to be versatile to various types of time-series data. “Various types” means that your program needs to be able to label data from sensors attached to the human body (like heart sounds) and sensors attached to manufacturing equipment (like vibration warnings) equally.
-- Sample Data
We will provide 2 types of time-series sensor data, and training data that has human provided labels to the portions of them. The files will be provided in the forums.
- Time-series sensor data in CSV format, having following columns:
- Sensor Data 1...N
- As obvious, the number of sensors (columns) varies per data type.
- Training data in CSV format, having following columns:
- Start Time
- End Time
- Labels can be specified as just characters. In the case of sample data No. 1, it’s 0~8. Please note that the labels can be less or more (ex. just 3 for manufacturing equipment emulating “ inactive(1) / abnormal(2) / idling(3) “)
-- Idea and Algorithm that you need to come up
Training data is labeling just portions of the actual time-series data. The objective is to label time-series data (rows) that are NOT yet labeled from the given training data.
The program should output new CSV file (predicted.csv) with the time (rows) filled with labels.
Below is the overall diagram of what we will be looking for;
- Labels on training data are scattered around. For example, if target time-series data have time from 1 to 100, the training might be labeling only from 10 to 40 and 60 to 70. This means that you will need to come up to label 0 to 10, 40 to 60, and 70 to 100 areas.
- You should also note that in normal case, there could be rows that labels are not applied since those rows do NOT implicate any feature patterns that training data has labels on.
-- The Report
We are looking for a report that clearly and logically understand why your idea is a great one, and how it is implemented to the actual running program.
Following are the key technical items that we would like to know how your idea is labelling time-series data and why it is a great solution.
- How does your solution/idea cut time-series data into units? Is it fixed or variable length, why did you choose your approach, and why is it better?
- Why can your solution/idea extract valid features accurately from multiple data (columns) correlated in label feature? Please refer to this article as an example if you are not familiar with feature extraction for time-series data.
- Why can your idea (architecture and algorithm) label data accurately?
- Versatility to multiple types of time series data: Without specified change to adapt, can it be applied to other types of time-series data, such as heart sounds, gait data, data from manufacturing equipment, etc.?
Additional rules for the report are written at the Submission section below.
Please note that above is not the only item we are looking in your report. Also, the above items are the one that we currently think are more important to understand the difference and benefits of your idea than others. You are welcome to come up with other items if you see the fit.
-- The Program
Please submit a program that implements your idea.
The program needs to read time-series sensor data and the training data, and output the complete predicted CSV file. At the final evaluation, we will be running your program against different data types that have different numbers of sensors in order to evaluate accuracy, feasibility, and versatility.
The Prediction CSV should have following column format;
- Data name
- You just need to put in the target time-series CSV file name for all the rows
- Start Time
- End time
- Probability (optional)
- Probability is used just as a hint for the client if your algorithm can output this. And will not be used to score your solution.
Please provide the shell script to run your algorithm as below;
$ predict.sh [target csv] [training csv] [predicted csv]
- [input] target csv : file name of the target sensor data
- [input] training csv : file name of the training data
- [output] predicted csv : file name of the csv your algorithm outputs
Additional rules for the program are written at the Submission section below.
Note: Since we are planning follow up contests to enhance the algorithm with real data, your running program doesn’t have to be a perfect solution in this challenge especially in performance and scalability to any type of time-series data as Judging Criteria below addresses.
The following link in an example report is writing one of the solutions for this challenge. This paper presents a weak supervision framework for programmatically labeling time series training data. However, the description is not specific enough for actual implementation, and we are looking for more detailed documentation that can be applied to actual implementation. We are also looking for actual running code on this challenge.
Following are 2 algorithms that might be worth looking at based on the client’s initial research. Note that you are not limited to following ideas but just provided here as an information and hint to get you started.
- Weak supervision
The client predicts that weak supervision could be one of the best applicable architecture as Snorkel and Fonduer are using it.
The client also considered Auto-Encoder - which can be used to detect anomalies and outliers in datasets after it learns normal data - as a solution before, although the result was not accurate enough in limited testing. Also, some thought is necessary for multiple labeling. Please refer to this article for more details.
We will be having checkpoint submission in this challenge. We encourage everyone to submit to the checkpoint in order for the possibility of getting feedback from the clients of your idea.
In the checkpoint submission, please at least include the “Overview” part, and let us know what programming language you use.
You do not have to submit for the checkpoint to earn final prizes. But in order to qualify for the checkpoint prizes you must submit to the final round. The timeline and prizes for checkpoint is as follows:
- Checkpoint Deadline: 4 Feb 2020, 19:00 EST
- Checkpoint Prize: $100 for 5 submissions