K-MEANS
algorithm for finding the best clustering scheme
PROCEDURE
flowchart TD
A[acquire the number of cluster k as an input of the user];
B[chose k random points as temprary centers];
C[find the nearest center and define \n the first temprary clustering scheme];
D[computes the centroid of the cluster];
A-->B
B-->C
C-->D
D-->|repeat until the best \n clustering scheme|A
DISTORTION (SUM OF SQUARE ERRORS SSE)
is the sum of all distances between an element of the dataset and it’s encoded/decoded output squared
it measures how much the clustering scheme change the dataset
so in order to have a minimal distortion of the dataset:
- the function must translate in the nearest center
- the gradient of the distortion function w.r.t. the center of the cluster must be so
- each center must be the centroid of the points it owns
CHOOSING STARTING POINT
choosing the starting point correctly is important, some possible choices are:
- select random starting points
- choose the starting point as far as possible from the previious ones
CHOOSING THE NUMBER OF CLUSTERS
choosing the number of clusters correctly is important, one possible strategy is to use a quantitive evaluation of the quality of the clustering scheme.
The best value to aim to is a compromise between the minimization of intra-cluster distances and the maximization of the inter-cluster distances
EMPTY CLUSTER
during the clustering some clusters can become empty, so in this case there are 2 choices:
- choose a new centroid away from the empty one
- choose a new centroid at random with the maximum SSE in order to split in half the cluster with the lowest quality
OUTLIERS
there can be points far away from the centroid, this points are a bad influence for the SSE, in some cases this points need to be removed
COMPLEXITY
the complexity of the algorithm is :
- as the number of iterations
- as the number of clusters
- as the number of data points
- as the number of dimensions
PROS
- kmeans is efficient nearly linear in the number of datapoints
CONS
- k-means cannot work in space where distance cannot be computed
- cannot work with nominal data
- requires the K parameter (it can be computed but it is a cost)
- it is very sensitive to outliers
- does not deal with noise
- does not deal properly with non convex clusters