DATA TYPES

Data typesDescriptionExamplesDescriptive statistics allowedDomain
Categoricalnominalset of labels, the available information allows to distinguish a label from another. Operators: =, !=zip code, fiscal codemode,entropy,contingency,correlationDiscrete
ordinalOperators: <,<<,>,>>non numerical quality evaluationsmedia percentiles rank correlationsDiscrete
Numericalintervalpossible to add or subtract data +,-calendar dates, temperatures in C° and F°average standard deviationsContinous
ratioHave a univocal definition of 0 allowing all mathematical operationstemperatures in K°geometric mean harmonic mean percentage variationContinous

INTERVAL DATA

  • used in statistical research
  • examples:
    • Temperature
    • Scores
    • Time
    • IQ test

Interval vs Ratio

Interval does not preserve relative values upon scale change

ALLOWED TRANSFORMATIONS

data typetransformation
nominalany one-to-one correspondence
ordinalAny order preserving transformation (any monotonic function)
intervallinear functions
ratioany mathematical function, standardization,variation in percentage
  • this transformation does not change the meaning of the attribute, they are used to standardize data format.

ASYMMETRIC ATTRIBUTES

  • attributes where only presence is relevant (non null value)
    • example exams In particular, binary asymmetric attributes are relevant in the discovery of association rules

GENERAL CHARACTERISTICS OF DATA SETS

Dimensionality

  • the difference between having a small or a large (hundreds, thousands, … ) of attribute is also qualitative

Sparsity

  • when there are many zeros or nulls

Beware the nulls in disguise

  • a widespread bad habit is to store zero or some special value when a piece information is not available

Resolution

  • has a great influence on the results
  • the analysis of too detailed data can be affected by noise
  • the analysis of too general data can hide interesting patterns

The data is organized in records

  • Tables
  • Transactions
  • Data matrix
  • Sparse data matrix

DATA QUALITY

  • data from source layer are often dirty and full of outliers due to noise (example web crawler activity mixed with human activity on websites)
    • there can be missing values due to data not being collected
    • there can be duplicated values

DETECT OUTLIERS WITH DESCRIPTIVE STATISTICS

IQR = InterQuartile Range

IQR = Q3 - Q1`
lower-boundary = Q1 - IQR * 1.5
upper-boundary = Q3 + IQR * 1.5

with Q1 first quartile Q3 third quartile

the outliers are values outside the boundaries

HANDLING MISSING VALUES

strategycomment
ignoring the values that are missingextreme, not a generic good idea
insert all possible values weighted with probabilitiesused in probabilistic learning, expensive
estimate the missing valuesdefault choise

DUPLICATED DATA

  • major issue when dealing with data merging from different sources

NEXT