DECISION TREES

Decision trees are tree shaped data structures with node and leaf that represents operations on the datasets used to classified data

leaf → put data in class c (class of the leaf node)
node → split dataset and exec leaf node with the dataset subsets

SPLITTING DATASET METHODS

INFORMATION THEORY BASED METHOD

based on the concept of entropy

H (X) = - \sum p_{j} lo g 2 p_{j}

higher the entropy → attribute values with the same probability
lower the entropy → attribute values with the same probability

the entropy of a child node is always lower of the entropy of the father node

INFORMATION GAIN

the difference between the dataset entropy and the subset entropy obtained from a threshold based split

the objective is to maximize information gain in order to obtain the best split

CHOOSING THE RIGHT ATTRIBUTE FOR THE SPLIT

the attribute with the maximum IG is chosen

BUILD DECISION TREE WITH BINARY SPLITS

procedure buildTree(dataset X , node p)
 
	if all the class values of X are c then
		return node p as a leaf, label of p is c
 
	if no attribute can give a positive information gain in X then
		say that the majority of elements in X has class c
		return node p as a leaf, label of p is c
 
find the attribute d and threshold t giving maximum information gain in X
create two internal nodes descendant of p, say pleft and pright
let X left = selection on X with d < t
buildTree(X left , pleft )
let X right = selection on sdata with d >= t
buildTree(X right , pright )

TRAINING SET ERROR

number of discordance between predictions on the training set and the actual values of the attributes

it’s a lower bound for the error rate of the predictions

TEST SET ERROR

same calculus of the training set error

the value is more indicative of the expected behavior

OVERFITTING

it happens when the hypothesis is fits too well the training data

erro r_{t r ain} (h) < erro r_{t r ain} (h^{'})

erro r_{X} (h) > erro r_{X} (h^{'})

possible causes of over-fitting can be the presence of noise on the data or a bad representative training set of the $X$ dataset.

One of the possible solutions to overfitting is decision tree pruning

CHOOSING ATTRIBUTE FOR SPLIT WITH THE HIGHEST PURITY

INFORMATION THEORY BASED METHOD
GINI INDEX
MISCLASSIFICATION ERROR

GINI INDEX

based on the probability of wrong classification of a random assignment based on class frequencies

it is the sum of the frequencies of wrong classification

G I N I_{p} = 1 - j \sum f_{p j}^{2}

the split is done by minimize the gini index reduction

G I N I_{s pl i t} = G I N I_{p} - i = 1 \sum d s \frac{N _{p, i}}{N _{p}} * G I N I (p_{i})

MISCLASSIFICATION ERROR

it’s complement to 1 of the highest label frequency called accuracy the split is like the one of the GINI INDEX

ME (p) = 1 - ma x_{j} (f_{p j})

PREVIOUS NEXT

Explorer