K-MEANS

algorithm for finding the best clustering scheme

PROCEDURE

flowchart TD
	A[acquire the number of cluster k as an input of the user];
	B[chose k random points as temprary centers];
	C[find the nearest center and define \n the first temprary clustering scheme];
	D[computes the centroid of the cluster];
	A-->B
	B-->C
	C-->D
	D-->|repeat until the best \n clustering scheme|A

DISTORTION (SUM OF SQUARE ERRORS SSE)

is the sum of all distances between an element of the dataset and it’s encoded/decoded output squared

it measures how much the clustering scheme change the dataset

so in order to have a minimal distortion of the dataset:

  • the function must translate in the nearest center
  • the gradient of the distortion function w.r.t. the center of the cluster must be so
  • each center must be the centroid of the points it owns

CHOOSING STARTING POINT

choosing the starting point correctly is important, some possible choices are:

  • select random starting points
  • choose the starting point as far as possible from the previious ones

CHOOSING THE NUMBER OF CLUSTERS

choosing the number of clusters correctly is important, one possible strategy is to use a quantitive evaluation of the quality of the clustering scheme.

The best value to aim to is a compromise between the minimization of intra-cluster distances and the maximization of the inter-cluster distances

EMPTY CLUSTER

during the clustering some clusters can become empty, so in this case there are 2 choices:

  • choose a new centroid away from the empty one
  • choose a new centroid at random with the maximum SSE in order to split in half the cluster with the lowest quality

OUTLIERS

there can be points far away from the centroid, this points are a bad influence for the SSE, in some cases this points need to be removed

COMPLEXITY

the complexity of the algorithm is :

  • as the number of iterations
  • as the number of clusters
  • as the number of data points
  • as the number of dimensions

PROS

  • kmeans is efficient nearly linear in the number of datapoints

CONS

  • k-means cannot work in space where distance cannot be computed
  • cannot work with nominal data
  • requires the K parameter (it can be computed but it is a cost)
  • it is very sensitive to outliers
  • does not deal with noise
  • does not deal properly with non convex clusters

PREVIOUS NEXT