K-Means clustering method
A type of partitioning algorithm used to cluster data. We construct a partitioned of a database D of n objects into a set of k clusters.
Given k,
- Partition the data into k nonempty subsets
- Compute seed points as the centroids of the clusters of the current partition. The centroid is the mean point (center) of the cluster
- Assign each object to the cluster with the nearest seed point
- Return to the 2nd step and stop when no more assignments occur.
Strengths
- Relatively efficient (O(tkn), where n is # of objects, k is # of clusters and t is # of iterations.
- Often terminates at a local optimum
Weaknesses
- Applicable only when mean is definable
- We need to specify k in advance
- Unable to handle noisy data and outliers.
- Not suitable to discover clusters with non-convex shapes.
Variations
The k-means method can vary in:
- Selection of intial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
Handling categorical data
We can handle categorical data with by:
- Replacing means of clusters with modes
- Using new dissimilarity measures to deal with categorical objects
- Using a frequency based method to update modes of clusters
- A mixture of categorical and numerical data (k-prototype method)
page revision: 3, last edited: 29 Apr 2008 03:28