K-Means clustering method

A type of partitioning algorithm used to cluster data. We construct a partitioned of a database **D** of **n** objects into a set of **k** clusters.

Given ** k**,

- Partition the data into
*k*nonempty subsets - Compute seed points as the centroids of the clusters of the current partition. The centroid is the mean point (center) of the cluster
- Assign each object to the cluster with the nearest seed point
- Return to the 2
^{nd}step and stop when no more assignments occur.

## Strengths

- Relatively efficient (
*O(tkn)*, where*n*is # of objects,*k*is # of clusters and*t*is # of iterations. - Often terminates at a local optimum

## Weaknesses

- Applicable only when
*mean*is definable - We need to specify
*k*in advance - Unable to handle noisy data and outliers.
- Not suitable to discover clusters with non-convex shapes.

## Variations

The k-means method can vary in:

- Selection of intial
*k*means - Dissimilarity calculations
- Strategies to calculate cluster means

### Handling categorical data

We can handle categorical data with by:

- Replacing
*means of clusters*with*modes* - Using new dissimilarity measures to deal with categorical objects
- Using a
*frequency*based method to update modes of clusters - A mixture of categorical and numerical data (k-prototype method)

page revision: 3, last edited: 29 Apr 2008 03:28