Fun Series September - Learning Data Science - Salary Prediction

Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Project Overview

We have launched a new Fun Challenge series to help our members learn new skills/technologies while getting used to the Topcoder platform. This new contest is about Data Science.

Please note: This is a Fun and Learning Challenge. No prizes will be awarded for completing the challenge

Abstract: Predict whether income exceeds $50K/yr based on census data, also known as the "Adult" dataset. The dataset contains a column for salary which has values >$50K or ≤$50K. In this column you will find some Not Applicable (NA) values, and the challenge is to predict the NA values.

Source: UCI Machine Learning Repository 

Donor:  Ronny Kohavi and Barry Becker 
Data Mining and Visualization 
Silicon Graphics. 

Data Set Information: File sal_data.csv (the only file containing both training and testing records). It contains 9,048 records of which the value of the Salary column is NA, these can be used for testing. All remaining records where the value of the Salary column is already provide, can be used for training.

Prediction task is to determine whether a person makes more than 50K a year or not.

Attribute Information:

Salary >$50K, ≤$50K. (target variable)
Age: Continuous. 
Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
Fnlwgt: Continuous. 
Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
Education-num: Continuous. 
Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
Sex: Female, Male. 
Capital-gain: Continuous. 
Capital-loss: Continuous. 
Hours-per-week: Continuous. 
Native-country: United-States, Cambodia, china ….. 41 in total

Objective:To predict the salary as being >$50k or ≤$50k on all records where only NA is listed for salary. You will use the rows where the salary is available as your training data.

To give you a headstart, we are also providing you a baseline in R using Random Forest for the same application. You are free to use this code, improve it, or write your own code.



Final Submission Guidelines

Please note: This is a Fun and Learning Challenge. No prizes will be awarded for completing the challenge

Output expected: You need to submit the same sal_data.csv file with all of the columns, including salary, and the salary column should contain the predicted values for all the rows where it was missing in the original sal_data.csv file.

Example record output:

age

workclass

fnlwgt

education

education.num

marital.status

occupation

50

 Self-emp-not-inc

83311

 Bachelors

13

 Married-civ-spouse

 Exec-managerial

relationship

race

sex

capital.gain

capital.loss

hours.per.week

native.country

 Husband

 White

 Male

0

0

13

 United-States

salary

           

 <=50K

           

ELIGIBLE EVENTS:

2016 TopCoder(R) Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30051288