Clustering

Introduction

Clustering deals with finding structure in some set of unlabeled data. It is: “the process of organizing objects into groups whose members are similar in some way” [1]. A cluster is therefore a collection of objects which are "similar" in someway, and are "dissimilar" to the objects belonging to other clusters.


clustering.gif

Figure: Spatial clustering[1]


Goals of clustering

The goal of clustering is to determine some (implied) grouping in a set of unlabeled data. It can be shown that there is no absolute "best" criterion which would be independent of the final aim of the clustering. As a consequence it is the user which must supply the criterion (in such a way as to satisfy some underlying motive).


Methods of clustering

There are a number of techniques that have been extensively studied.

Partitioning algorithms

These algorithms construct varios partitions and then evaluate them by some criterion.
The basic concept is to construct a partition of a dataset D of n objects into a set of k clusters.
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion.

  • Global optimal: exhaustively enumerate all partitions.
  • Heuristic methods: k-means and k-medoids algorithms
  • k-means: each cluster is represented by the center of the cluster
  • k-medoids: aka PAM (Partition around medoids) each cluster is representedby one of the objects in the cluster.

Hiearchy algorithms

These create a hiearchical decomposition of the data set using some criterion. There are two approaches.

  • Agglomerative: (aka "Bottom-up") merges clusters iteratively.
  • Divisive: (aka "top-down") split a cluster iteratively.

Clustering methods (notably agglomerative ones) do not scale well. These is a time complexity of at least O(n2). Integration of hierarchical clustering with distance-based method is also possible.

Density-based

Based on connectivity and density functions.

Grid-based

Based on a multiple-level granularity structure.

Model-based

A model is hypothesized for each of the clusters. The best of that model is then applied to establish the clusters.


Bibliography
1. Cluster tutorial
2. Osmar R. Zaïane: "Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering".
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License