Preparing the data

In this tutorial, we propose to show how to build a model to predict if the SALARY of an individual is HIGH (>= 50K$) or LOW (< 50K$) according to some attributes describing this individual (occupation, gender, age, …).

Below is a dataset used to classify people with HIGH (>= 50K$) or LOW (< 50K$) salary according to their occupation, gender, age, level of education, amount of weekly working hours and size of the company they belong to.

[table id=3 /]

The target variable that we want our model to predict is the SALARY. We call it the dependent variable.
The variables the SALARY may depend on are OCCUPATION, GENDER, AGE, EDUCATION-YEARS, WEEKLY-WORKING-HOURS, COMPANY-SIZE. We call them the independent variables.

This dataset contains variable of different nature :
– Occupation, Gender and Salary are textual variables
– Age, Education-Years, Weekly-Working-Hours and Company-Size are numerical variables respectively ranging from 28 to 52, 5 to 14, 13 to 45, 16 to 125224.

The semantics of the data is as follows :
Concerning the OCCUPATION variable :
– Specialty : the occupation of the individual is a professional specialty (engineers, scientist, physicians, …)
– Management : the individual is working at decisional level (manager, director, chief executive, administrator, …)
– Manual : the individual is working in a physical job (technician, handler, warehouseman, …)
The EDUCATION-YEARS variable :
– below 9 years : the individual has quit High-School before graduation
– 9 years equals to a High-School graduation
– 13 years equals to a Bachelor level
– 14 years equals to a Master level

Here follows the dataset in CSV format :

OCCUPATION, GENDER, AGE, EDUCATION_YEARS, WEEKLY_WORKING_HOURS, COMPANY_SIZE, SALARY
Specialty, Male,, 13, 40, 125224, HIGH
Management, Male, 50, 13, 13, 125, LOW
Specialty, Male, 52, 14, 45, 3569, HIGH
Manual, Male, 38, 9, 40, 1950, LOW
Manual, Male, 53, 7,, 32, LOW
Specialty, Female, 31, 14, 50, 16, LOW
Management, Male, 28, 13, 45, 6554, HIGH
Management, Female, 37, 14, 40, 548, LOW
Manual, Female, 49, 5, 16, 2947, LOW
Management, Male,, 13, 40, 3216, HIGH