# Preparing the data

In this tutorial, we propose to show how to build a model to predict if the SALARY of an individual is HIGH (>= 50K\$) or LOW (< 50K\$) according to some attributes describing this individual (occupation, gender, age, …).

Below is a dataset used to classify people with HIGH (>= 50K\$) or LOW (< 50K\$) salary according to their occupation, gender, age, level of education, amount of weekly working hours and size of the company they belong to.

OCCUPATIONGENDERAGEEDUCATION YEARSWEEKLY WORKING HOURSCOMPANY SIZESALARY
SpecialtyMale1340125224HIGH
ManagementMale501313125LOW
SpecialtyMale5214453569HIGH
ManualMale389401950LOW
ManualMale53732LOW
SpecialtyFemale31145016LOW
ManagementMale2813456554HIGH
ManagementFemale371440548LOW
ManualFemale495162947LOW
ManagementMale13403216HIGH

The target variable that we want our model to predict is the SALARY. We call it the dependent variable.
The variables the SALARY may depend on are OCCUPATION, GENDER, AGE, EDUCATION-YEARS, WEEKLY-WORKING-HOURS, COMPANY-SIZE. We call them the independent variables.

This dataset contains variable of different nature :
– Occupation, Gender and Salary are textual variables
– Age, Education-Years, Weekly-Working-Hours and Company-Size are numerical variables respectively ranging from 28 to 52, 5 to 14, 13 to 45, 16 to 125224.

The semantics of the data is as follows :
Concerning the OCCUPATION variable :
– Specialty : the occupation of the individual is a professional specialty (engineers, scientist, physicians, …)
– Management : the individual is working at decisional level (manager, director, chief executive, administrator, …)
– Manual : the individual is working in a physical job (technician, handler, warehouseman, …)
The EDUCATION-YEARS variable :
– below 9 years : the individual has quit High-School before graduation
– 9 years equals to a High-School graduation
– 13 years equals to a Bachelor level
– 14 years equals to a Master level

Here follows the dataset in CSV format :
``` OCCUPATION, GENDER, AGE, EDUCATION_YEARS, WEEKLY_WORKING_HOURS, COMPANY_SIZE, SALARY Specialty, Male,, 13, 40, 125224, HIGH Management, Male, 50, 13, 13, 125, LOW Specialty, Male, 52, 14, 45, 3569, HIGH Manual, Male, 38, 9, 40, 1950, LOW Manual, Male, 53, 7,, 32, LOW Specialty, Female, 31, 14, 50, 16, LOW Management, Male, 28, 13, 45, 6554, HIGH Management, Female, 37, 14, 40, 548, LOW Manual, Female, 49, 5, 16, 2947, LOW Management, Male,, 13, 40, 3216, HIGH ```