Tuesday, June 27, 2017
The input data for a classification task
The input data for a classification task is a collection of records. Each record, also known as an instance or example, is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated as the class label (also known as category or target attribute).
Monday, June 26, 2017
Characteristics of Decision Tree Induction
- Decision tree induction is a nonparametric approach for building classification models. In other words, it does not require any prior assumptions regarding the type of probability distributions satisfied by the class and other attributes.
- Finding an optimal decision tree is an NP-complete problem. Many decision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space.
- Techniques developed for constructing decision trees are computationally inexpensive, making it possible to quickly construct models even when the training set size is very large. Furthermore, once a decision tree has been built, classifying a test record is extremely fast, with a worst-case complexity of O(w), where w is the maximum depth of the tree.
- Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of the trees are also comparable to other classification techniques for many simple data sets.
- Decision trees provide an expressive representation for learning discrete-valued functions. However, they do not generalize well to certain types of Boolean problems. One notable example is the parity function, whose value is 0 (1) when there is an odd (even) number of Boolean attributes with the value T rue. Accurate modeling of such a function requires a full decision tree with 2d nodes, where d is the number of Boolean attributes.
- Decision tree algorithms are quite robust to the presence of noise, especially when methods for avoiding overfitting are employed.
- The presence of redundant attributes does not adversely affect the accuracy of decision trees. An attribute is redundant if it is strongly correlated with another attribute in the data. One of the two redundant attributes will not be used for splitting once the other attribute has been chosen. However, if the data set contains many irrelevant attributes, i.e., attributes that are not useful for the classification task, then some of the irrelevant attributes may be accidently chosen during the tree-growing process, which results in a decision tree that is larger than necessary. Feature selection techniques can help to improve the accuracy of deci- sion trees by eliminating the irrelevant attributes during preprocessing.
- Since most decision tree algorithms employ a top-down, recursive partitioning approach, the number of records becomes smaller as we traverse down the tree. At the leaf nodes, the number of records may be too small to make a statistically significant decision about the class representation of the nodes. This is known as the data fragmentation problem. One possible solution is to disallow further splitting when the number of records falls below a certain threshold.
- A subtree can be replicated multiple times in a decision tree. This makes the decision tree more complex than necessary and perhaps more difficult to interpret. Such a situation can arise from decision tree implementations that rely on a single attribute test condition at each internal node. Since most of the decision tree algorithms use a divide-and-conquer partitioning strategy, the same test condition can be applied to different parts of the attribute space, thus leading to the subtree replication problem.
- The test conditions described so far involve using only a single attribute at a time. As a consequence, the tree-growing procedure can be viewed as the process of partitioning the attribute space into disjoint regions until each region contains records of the same class. The border between two neighboring regions of different classes is known as a decision boundary. Since the test condition involves only a single attribute, the decision boundaries are rectilinear; i.e., parallel to the “coordinate axes.” This limits the expressiveness of the decision tree representation for modeling complex relationships among continuous attributes.
Sunday, June 25, 2017
Thursday, June 22, 2017
The Success of Decision Tree
The success of decision trees is explained by several factors that make them quite attractive in practice:
- Decision trees are non-parametric. They can model arbitrarily complex relations between inputs and outputs, without any a priori assumption;
- Decision trees handle heterogeneous data (ordered or categorical variables, or a mix of both);
- Decision trees intrinsically implement feature selection, making them robust to irrelevant or noisy variables (at least to some extent);
- Decision trees are robust to outliers or errors in labels;
- Decision trees are easily interpretable, even for non-statistically oriented users.
Wednesday, June 21, 2017
A Gentle Introduction to Random Forests
The random forest (Breiman, 2001) is an ensemble approach that can also be thought of as a form of nearest neighbor predictor. Ensembles are a divide-and-conquer approach used to improve performance. The main principle behind ensemble methods is that a group of “weak learners” can come together to form a “strong learner”. The figure below (taken from here) provides an example. Each classifier, individually, is a “weak learner,” while all the classifiers taken together are a “strong learner”.
Friday, June 16, 2017
Thursday, June 15, 2017
Tuesday, June 13, 2017
Subscribe to:
Posts (Atom)