K-Means clustering method

A type of partitioning algorithm used to cluster data. We construct a partitioned of a database D of n objects into a set of k clusters.

Given k,

  • Partition the data into k nonempty subsets
  • Compute seed points as the centroids of the clusters of the current partition. The centroid is the mean point (center) of the cluster
  • Assign each object to the cluster with the nearest seed point
  • Return to the 2nd step and stop when no more assignments occur.

Strengths

  • Relatively efficient (O(tkn), where n is # of objects, k is # of clusters and t is # of iterations.
  • Often terminates at a local optimum

Weaknesses

  • Applicable only when mean is definable
  • We need to specify k in advance
  • Unable to handle noisy data and outliers.
  • Not suitable to discover clusters with non-convex shapes.

Variations

The k-means method can vary in:

  • Selection of intial k means
  • Dissimilarity calculations
  • Strategies to calculate cluster means

Handling categorical data

We can handle categorical data with by:

  • Replacing means of clusters with modes
  • Using new dissimilarity measures to deal with categorical objects
  • Using a frequency based method to update modes of clusters
  • A mixture of categorical and numerical data (k-prototype method)
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License