SAMPLING

the process of reducing the dataset dimensions making samples, the goals are:

  • reduce dimension of datasets for computing necessities
  • impossibility to obtain the full dataset

a sample of a dataset can be usefull if it is representative

TYPES

  • SIMPLE RANDOM

    random choice of a object with given probability distribution

  • WITH REPLACEMENT

    repetition of independent extractions of type symple random

  • WITHOUT REPLACEMENT

    repetition of extractions, extracted element is removed from the population, in a small population a small subject could be underestimated

  • STRATIFIED

    split data into several partitions according to some criteria, then draw the random samples from each partition

SAMPLE SIZE

select the sample size is a tradeoff between data reduction and precision, there are techniques to get the optimal sample size and a sample that has meaning

MISSING CLASSES

the probability of sampling at least an element for each class is independent from the size of the dataset (if using replacement)

this is important when using a small dataset for cross-validation or train test splits cause there could be not enough data for the partition

PREVIOUS NEXT