DATA TYPES

	Data types	Description	Examples	Descriptive statistics allowed	Domain
Categorical	nominal	set of labels, the available information allows to distinguish a label from another. Operators: =, !=	zip code, fiscal code	mode,entropy,contingency,correlation	Discrete
	ordinal	Operators: <,<<,>,>>	non numerical quality evaluations	media percentiles rank correlations	Discrete
Numerical	interval	possible to add or subtract data +,-	calendar dates, temperatures in C° and F°	average standard deviations	Continous
	ratio	Have a univocal definition of 0 allowing all mathematical operations	temperatures in K°	geometric mean harmonic mean percentage variation	Continous

INTERVAL DATA

used in statistical research
examples:
- Temperature
- Scores
- Time
- IQ test

Interval vs Ratio

Interval does not preserve relative values upon scale change

ALLOWED TRANSFORMATIONS

data type	transformation
nominal	any one-to-one correspondence
ordinal	Any order preserving transformation (any monotonic function)
interval	linear functions
ratio	any mathematical function, standardization,variation in percentage

this transformation does not change the meaning of the attribute, they are used to standardize data format.

ASYMMETRIC ATTRIBUTES

attributes where only presence is relevant (non null value)
- example exams In particular, binary asymmetric attributes are relevant in the discovery of association rules

GENERAL CHARACTERISTICS OF DATA SETS

Dimensionality

the difference between having a small or a large (hundreds, thousands, … ) of attribute is also qualitative

Sparsity

when there are many zeros or nulls

Beware the nulls in disguise

a widespread bad habit is to store zero or some special value when a piece information is not available

Resolution

has a great influence on the results
the analysis of too detailed data can be affected by noise
the analysis of too general data can hide interesting patterns

The data is organized in records

Tables
Transactions
Data matrix
Sparse data matrix

DATA QUALITY

data from source layer are often dirty and full of outliers due to noise (example web crawler activity mixed with human activity on websites)
- there can be missing values due to data not being collected
- there can be duplicated values

DETECT OUTLIERS WITH DESCRIPTIVE STATISTICS

IQR = InterQuartile Range

IQR = Q3 - Q1`
lower-boundary = Q1 - IQR * 1.5
upper-boundary = Q3 + IQR * 1.5

with Q1 first quartile Q3 third quartile

the outliers are values outside the boundaries

HANDLING MISSING VALUES

strategy	comment
ignoring the values that are missing	extreme, not a generic good idea
insert all possible values weighted with probabilities	used in probabilistic learning, expensive
estimate the missing values	default choise

DUPLICATED DATA

major issue when dealing with data merging from different sources

Explorer

DATA TYPES

INTERVAL DATA

Interval vs Ratio

ALLOWED TRANSFORMATIONS

ASYMMETRIC ATTRIBUTES

GENERAL CHARACTERISTICS OF DATA SETS

Dimensionality

Sparsity

Beware the nulls in disguise

Resolution

DATA QUALITY

DETECT OUTLIERS WITH DESCRIPTIVE STATISTICS

IQR = InterQuartile Range

HANDLING MISSING VALUES

DUPLICATED DATA

Graph View

Table of Contents

Backlinks