DECISION TREES
Decision trees are tree shaped data structures with node and leaf that represents operations on the datasets used to classified data
-
leaf → put data in class c (class of the leaf node)
-
node → split dataset and exec leaf node with the dataset subsets
SPLITTING DATASET METHODS
INFORMATION THEORY BASED METHOD
based on the concept of entropy
-
higher the entropy → attribute values with the same probability
-
lower the entropy → attribute values with the same probability
the entropy of a child node is always lower of the entropy of the father node
INFORMATION GAIN
the difference between the dataset entropy and the subset entropy obtained from a threshold based split
the objective is to maximize information gain in order to obtain the best split
CHOOSING THE RIGHT ATTRIBUTE FOR THE SPLIT
the attribute with the maximum IG is chosen
BUILD DECISION TREE WITH BINARY SPLITS
TRAINING SET ERROR
number of discordance between predictions on the training set and the actual values of the attributes
it’s a lower bound for the error rate of the predictions
TEST SET ERROR
same calculus of the training set error
the value is more indicative of the expected behavior
OVERFITTING
it happens when the hypothesis is fits too well the training data
possible causes of over-fitting can be the presence of noise on the data or a bad representative training set of the dataset.
One of the possible solutions to overfitting is decision tree pruning
CHOOSING ATTRIBUTE FOR SPLIT WITH THE HIGHEST PURITY
GINI INDEX
based on the probability of wrong classification of a random assignment based on class frequencies
it is the sum of the frequencies of wrong classification
the split is done by minimize the gini index reduction
MISCLASSIFICATION ERROR
it’s complement to 1 of the highest label frequency called accuracy the split is like the one of the GINI INDEX