## Introduction

*Clustering* deals with finding structure in some set of unlabeled data. It is: “the process of organizing objects into groups whose members are similar in some way” [1]. A cluster is therefore a collection of objects which are "similar" in someway, and are "dissimilar" to the objects belonging to other clusters.

Figure: Spatial clustering[1]

## Goals of clustering

The goal of clustering is to determine some (implied) grouping in a set of unlabeled data. It can be shown that there is no absolute "best" criterion which would be independent of the final aim of the clustering. As a consequence it is the user which must supply the criterion (in such a way as to satisfy some underlying motive).

## Methods of clustering

There are a number of techniques that have been extensively studied.

### Partitioning algorithms

These algorithms construct varios partitions and then evaluate them by some criterion.

The basic concept is to construct a partition of a dataset **D** of **n** objects into a set of **k** clusters.

Given a *k*, find a partition of *k clusters* that optimizes the chosen partitioning criterion.

- Global optimal: exhaustively enumerate all partitions.
- Heuristic methods: k-means and k-medoids algorithms
- k-means: each cluster is represented by the center of the cluster
- k-medoids: aka PAM (Partition around medoids) each cluster is representedby one of the objects in the cluster.

### Hiearchy algorithms

These create a hiearchical decomposition of the data set using some criterion. There are two approaches.

- Agglomerative: (aka "Bottom-up") merges clusters iteratively.
- Divisive: (aka "top-down") split a cluster iteratively.

Clustering methods (notably agglomerative ones) do not scale well. These is a time complexity of *at least* **O(n ^{2})**. Integration of hierarchical clustering with distance-based method is also possible.

### Density-based

Based on connectivity and density functions.

### Grid-based

Based on a multiple-level granularity structure.

### Model-based

A model is hypothesized for each of the clusters. The best of that model is then applied to establish the clusters.

*Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering*".