Challenge Overview
Project Overview
We are launching a series of fun Challenges, intended for learning new technology and topcoder platform. This week the challenge is about Data Science.
Important :
This is a fun and learning challenge. No prizes will be awarded for completing the challenge.
Overview
A large amount of data has been and continues to be collected from the vehicles that ABC Corp uses for delivery of cargo to customers. The data includes On-board Computer (OBC) alarms that record exceptional driving events such as excessive speed, speed changes, and tractor stability during operation. Existing OBC data is correlated with other information such as time of day, driver status, route details, cargo, and weather conditions to provide a broad spectrum of data related to ABC corp deliveries.
Since accidents are extremely rare and since the ideal objective would be to prevent all accidents, OBC alarms (occurring on fewer than 5% of all trips) are considered an important factor in managing safety. Preparation for the current match assumes that the ability to use correlated data to anticipate and thus reduce OBC events will further increase the safety of trips.
Problem Statement
The current challenge will be successful when community provides algorithmic solutions that, when run, can identify Top 500 which in a data-set are most likely to involve alarms from on-board computers. These algorithms will ultimately provide input to the logistical planning system used by ABC corp.
Overview of Data
source |
|
pilot |
dist |
|
pilot2 |
cycles |
|
pilot_exp |
complexity |
|
pilot_visits_prev |
cargo |
|
pilot_hours_prev |
stops |
|
pilot_duty_hrs_prev |
start_month |
|
pilot_dist_prev |
start_day_of_month |
|
route_risk_1 |
start_day_of_week |
|
route_risk_2 |
start_time |
|
weather |
days |
|
visibility |
pilot |
|
Risk_involved |
The target variable Risk_involved is the aggregation of all OBC events. In the training data set it has the levels “n” and “r” which means not risky and risky respectively. The training data set contains around 80 K records . The test data set has around 42 K records. You need to find the top 500 trips that are most likely to be risky i.e your submission file would have 500 records. Your output will be in the following format.
Here is one sample output record that is expected:
source |
dist |
cycles |
complexity |
cargo |
stops |
start_month |
start_day_of_month |
L04 |
267 |
1 |
14 |
5 |
2 |
10 |
20 |
start_day_of_week |
start_time |
days |
pilot |
pilot2 |
pilot_exp |
pilot_visits_prev |
7 |
1632 |
0.33 |
17355 |
0 |
3 |
1 |
pilot_hours_prev |
pilot_duty_hrs_prev |
pilot_dist_prev |
route_risk_1 |
route_risk_2 |
17.6 |
13.1 |
942.9 |
97 |
209 |
weather |
visibility |
Risk_involved |
Prob |
2 |
8.466666667 |
r |
0.52 |
Here is one sample output record that is expected:
source |
dist |
cycles |
complexity |
cargo |
stops |
start_month |
start_day_of_month |
L04 |
267 |
1 |
14 |
5 |
2 |
10 |
20 |
start_day_of_week |
start_time |
days |
pilot |
pilot2 |
pilot_exp |
pilot_visits_prev |
7 |
1632 |
0.33 |
17355 |
0 |
3 |
1 |
pilot_hours_prev |
pilot_duty_hrs_prev |
pilot_dist_prev |
route_risk_1 |
route_risk_2 |
17.6 |
13.1 |
942.9 |
97 |
209 |
weather |
visibility |
Risk_involved |
Prob |
2 |
8.466666667 |
r |
0.52 |
Your evaluation Criteria would be based on = 100 * precision for 500 trips
Final Submission Guidelines
Your ouput would be the test.csv with the values for Risk_involved column along with their probabilities which would be used to generate the AUC by using ROC curves..
Upload output csv file for submiting this challenge.
Important :
This is a fun and learning challenge. No prizes will be awarded for completing the challenge.