Clustering

## Introduction

Clustering deals with finding structure in some set of unlabeled data. It is: “the process of organizing objects into groups whose members are similar in some way” . A cluster is therefore a collection of objects which are "similar" in someway, and are "dissimilar" to the objects belonging to other clusters. Figure: Spatial clustering

## Goals of clustering

The goal of clustering is to determine some (implied) grouping in a set of unlabeled data. It can be shown that there is no absolute "best" criterion which would be independent of the final aim of the clustering. As a consequence it is the user which must supply the criterion (in such a way as to satisfy some underlying motive).

## Methods of clustering

There are a number of techniques that have been extensively studied.

### Partitioning algorithms

These algorithms construct varios partitions and then evaluate them by some criterion.
The basic concept is to construct a partition of a dataset D of n objects into a set of k clusters.
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion.

• Global optimal: exhaustively enumerate all partitions.
• Heuristic methods: k-means and k-medoids algorithms
• k-means: each cluster is represented by the center of the cluster
• k-medoids: aka PAM (Partition around medoids) each cluster is representedby one of the objects in the cluster.

### Hiearchy algorithms

These create a hiearchical decomposition of the data set using some criterion. There are two approaches.

• Agglomerative: (aka "Bottom-up") merges clusters iteratively.
• Divisive: (aka "top-down") split a cluster iteratively.

Clustering methods (notably agglomerative ones) do not scale well. These is a time complexity of at least O(n2). Integration of hierarchical clustering with distance-based method is also possible.

### Density-based

Based on connectivity and density functions.

### Grid-based

Based on a multiple-level granularity structure.

### Model-based

A model is hypothesized for each of the clusters. The best of that model is then applied to establish the clusters.

Bibliography
1. Cluster tutorial
2. Osmar R. Zaïane: "Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering".
page revision: 14, last edited: 29 Apr 2008 03:56