A Random Forest is a combination of predictive trees such that each tree depends on the values of a randomly tested vector independently and with the same distribution for each of these. It is a substantial bagging modification that builds a long collection of uncorrelated trees and then averages them.
In many problems the performance of the random forest algorithm is very similar to that of boosting, and it is simpler to train and adjust. As a consequence the random forests is popular and widely used.
The essential idea of bagging is to average many noisy but approximately impartial models, and therefore reduce the variation. Trees are ideal candidates for bagging, since they can register complex interaction structures in the data, and if they grow sufficiently deep, they have relatively low bias. Product that the trees are notoriously noisy, they benefit greatly when averaging.
Each tree is constructed using the following algorithm:
- Let T be the number of records, V is the number of variables in the classifier.
- Let v be the number of input variables to be used to determine the decision at a given node; v must be much smaller than V
- Divide the T records into a training set to build the tree and a test set to estimate the error.
- For each node of the tree, randomly choose v variables on which to base the decision. Calculate the best partition of the training set from the v variables.
In order to make a prediction, a new case is pushed down the tree till reaching a terminal node label. This process is iterated in all the trees of the forest, and the most frequent label is returned as the prediction.
The advantages of random forests are :
- It is one of the most accurate learning algorithms available. For a sufficiently large data set, it produces a very accurate classifier.
- Run efficiently in large databases.
- It can handle hundreds of input variables without excluding any.
- It provides estimates of what variables are important in the classification.
- It has an effective method for estimating missing data and maintaining accuracy when a large proportion of the data is missing.
- It provides information about existing correlation between the variables and relationships between the variables and the classification results.
The disadvantages of random forests are :
- It has been observed that Random forests over-fits in certain data groups with noisy classification / regression tasks.
- Unlike decision trees, the classification made by random forests is difficult to interpret by man.
- For data that includes categorical variables with different number of levels, the random forests is biased in favor of those attributes with more levels. Therefore, the position of the variable is not reliable for this type of data. Methods such as partial permutations have been used to solve the problem.
- If the data contains groups of attributes correlated with similar relevance to performance, then smaller groups are advantaged over larger groups.
USE CASE : CLASSIFYING BIG SALARIES
In this use case we want to build a model that estimates if an individual is more likely to have a big salary (>50K) or not according to his age, level of education (in years) and weekly working hours. The following dataset contains 5280 rows, each providing the description of an individual in terms of age, level of education (in years), weekly working hours and if whether or not his salary is considered as big (>50K). Here follows the first 40 rows of the dataset :