Challenge Overview
Project Overview
We have launched a new Fun Challenge series to help our members learn new skills/technologies while getting used to the Topcoder platform. This new contest is about Data Science.
Please note: This is a Fun and Learning Challenge. No prizes will be awarded for completing the challenge
Abstract: Predict whether income exceeds $50K/yr based on census data, also known as the "Adult" dataset. The dataset contains a column for salary which has values >$50K or ≤$50K. In this column you will find some Not Applicable (NA) values, and the challenge is to predict the NA values.
Source: UCI Machine Learning Repository
Donor: Ronny Kohavi and Barry Becker
Data Mining and Visualization
Silicon Graphics.
Data Set Information: File sal_data.csv (the only file containing both training and testing records). It contains 9,048 records of which the value of the Salary column is NA, these can be used for testing. All remaining records where the value of the Salary column is already provide, can be used for training.
Prediction task is to determine whether a person makes more than 50K a year or not.
Attribute Information:
Salary >$50K, ≤$50K. (target variable)
Age: Continuous.
Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
Fnlwgt: Continuous.
Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
Education-num: Continuous.
Marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
Sex: Female, Male.
Capital-gain: Continuous.
Capital-loss: Continuous.
Hours-per-week: Continuous.
Native-country: United-States, Cambodia, china ….. 41 in total
Objective:To predict the salary as being >$50k or ≤$50k on all records where only NA is listed for salary. You will use the rows where the salary is available as your training data.
To give you a headstart, we are also providing you a baseline in R using Random Forest for the same application. You are free to use this code, improve it, or write your own code.
Final Submission Guidelines
Please note: This is a Fun and Learning Challenge. No prizes will be awarded for completing the challenge
Output expected: You need to submit the same sal_data.csv file with all of the columns, including salary, and the salary column should contain the predicted values for all the rows where it was missing in the original sal_data.csv file.
Example record output:
age |
workclass |
fnlwgt |
education |
education.num |
marital.status |
occupation |
50 |
Self-emp-not-inc |
83311 |
Bachelors |
13 |
Married-civ-spouse |
Exec-managerial |
relationship |
race |
sex |
capital.gain |
capital.loss |
hours.per.week |
native.country |
Husband |
White |
Male |
0 |
0 |
13 |
United-States |
salary |
||||||
<=50K |