Decision tree learning refers to a method based on the use of a decision tree as a predictive model. It is used in particular in data mining and machine learning. It is a supervised learning technique: we use a set of data for which we know the value of the target variable in order to build the tree (so-called labeled data), then we extrapolate the results to all the test dataset.

In these tree structures, the leaves represent the values of the target variable and the branches correspond to combinations of input variables that lead to these values. In decision analysis, a decision tree can be used to explicitly represent the decisions made and the processes that lead to them. In learning and in data mining, a decision tree describes the data but not the decisions themselves, the tree would be used as a starting point for the decision process.

One of the input variables is selected at each inner node (or internal, non-terminal node) of the tree according to a method that depends on the algorithm and will be discussed later. Each edge to a child node corresponds to a set of values of an input variable, so that all edges to the child nodes cover all possible values of the input variable.

Each leaf (or terminal node of the tree) represents either a value of the target variable or a probability distribution of the various possible values of the target variable. The combination of the values of the input variables is represented by the path from the root to the leaves.

The tree is generally constructed by separating all the data into subsets according to the value of a particular input. This process is repeated on each subset obtained recursively : therefore, it is a recursive partitioning.

Recursion is completed at a node either when all subsets have the same value of the target feature, or when the separation no longer improves the prediction. This process is called top-down induction of decision trees (TDIDT), it is a greedy algorithm since at each node of the tree we are looking for a good sharing, in order to obtain the best possible sharing over the entire decision tree. This is the most common strategy for learning decision trees from data.

In data mining, decision trees can help to describe, categorize, or generalize a fixed dataset.

Compared to other methods of data mining, decision trees have several advantages:

– The simplicity of understanding and interpretation : if we observe a certain situation on a model, this one can be easily explained using boolean logic, contrary to models like neural networks, for which the the results explanation is difficult to understand.

– Little data preparation (no standardization, empty values to delete, or dummy variables).

– The model can handle both numeric values and categories. Other techniques are often specialized on a certain type of variable (neural networks can only be used on numerical variables).

– It is possible to validate a model using statistical tests, and thus to report on the reliability of the model.

– Performing on large data sets: the method is relatively efficient in terms of computing resources.

On the other hand, it has certain disadvantages:

– Learning the optimal decision tree is NP-complete regarding several aspects of optimality. Consequently, the decision tree learning algorithms are based on heuristics such as greedy algorithms seeking to optimize the sharing at each node, and such algorithms do not guarantee to obtain the global optimum.

– Learning by decision tree can lead to very complex decision trees, which generalize the learning set badly (this is the problem of over-learning previously mentioned). Pruning procedures are used to circumvent this problem.

– Some concepts are difficult to express using decision trees (like XOR or parity). In these cases, decision trees become extremely large. To solve this problem, several means exist, such as proportionalization, or the use of learning algorithms using more expressive representations.

– When the data includes attributes with multiple levels, the information gain in the tree is skewed in favor of these attributes. However, the problem can be circumvented by methods such as conditional inference.

__USE CASE : CLASSIFYING BIG SALARIES __

In this use case we want to build a model that estimates if an individual is more likely to have a big salary (>50K) or not according to his age, level of education (in years) and weekly working hours. The following dataset contains 5280 rows, each providing the description of an individual in terms of age, level of education (in years), weekly working hours and if whether or not his salary is considered as big (>50K). Here follows the first 40 rows of the dataset :