MODEL SELECTION

In order to select the best model for a given problem the follow parameters need to be identified:

  • best classification model
  • best classification algorithm
  • best parameter configuration

EVALUATION

the process for detecting the best classification model, this process is independent from the algorithm used to create the model

DATASET IN EVALUATION

supervised data are usually scarse so the dataset must be split

  • train
  • evaluation
  • test

TEST SET ERROR AND RUN TIME RELATIONS

the bond between training dataset and the real data is subject to probabilistic variability so the prevision of the run time error error is the test set error ratio + confidence interval

CONFIDENCE INTERVAL

the empirical frequency of error with the test error and the test set dimension is related with the true error frequency through noise that is represented with a normal distribution (for )

the confidence interval is the probability that the true frequency of success is below the pessimistic frequency

boundaries on the curve depends on the desired confidence level

NEXT