Clustering

Objectives

  • Understand the goal of clustering as an unsupervised learning method.
  • Learn how k-means and hierarchical clustering assign observations to groups.
  • Explore linkage methods and distance metrics used in clustering.
  • Evaluate clustering results using cluster validity metrics.
  • Visualize clustering output using dendrograms and heatmaps.

Motivation for Clustering

When plotting two numerical variables:

  • Are all observations part of one homogeneous group?
  • Or do some form distinct subgroups that behave more similarly to each other?

Key points:

  • Clustering is an unsupervised learning method.
  • Each data point is assigned to a group (cluster).
  • Points within a cluster are more similar.
  • Points in different clusters are more dissimilar.

Heterogeneous Populations

  • Most populations are heterogeneous.
  • Benefits of identifying subpopulations:
    • Deeper EDA and sanity checks (for classification problems)
    • Unique trends may require unique models
    • Better decision-making in downstream tasks

k-Means Clustering

Metric for Cluster Fit

Assuming observations are already assigned to clusters, there is a trade-off between:

  • Within-cluster variability
  • Between-cluster variability

Within-cluster metric:

  • Let \(k\) be the number of clusters.
  • Let \(C_k\) be the set of observations assigned to cluster \(k\).
  • Define within-cluster variability \(W(C_k)\) as: \[ W(C_k) = \frac{1}{|C_k|} \sum_{i,j \in C_k} d(x_i, x_j)^2 \] where, \(|C_k|\) is the number of observations in cluster \(k\), and \(d(x_i, x_j)\) is the Euclidean distance between two points in the same cluster.
  • The goal is to minimize overall within-cluster variability.
  • However, within-cluster variability can always be minimized by increasing \(k\), so the key question is: how many clusters are actually present?

Challenges in Minimizing \(W(C_k)\)

  • There are a large number of possible cluster assignments.
  • The number of possible assignments increases with \(k\) (number of clusters) and \(n\) (sample size).
  • k-means clustering may converge to a local minimum rather than the global best solution.
  • The number of clusters \(k\) must be specified upfront to initialize the algorithm.

k-Means Algorithm

  1. Randomly assign observations to cluster groups.
  2. Reassignment:
    1. Compute the centroid of each group (i.e., the mean of the assigned observations).
    2. Reassign observations to the nearest centroid.
  3. Repeat step 2 until within-cluster variability converges.

Defining “Close”

Distance metrics:

  • Euclidean distance
  • Mahalanobis distance
  • Manhattan distance

Hierarchical Clustering

  • Does not require specifying \(k\) in advance.
  • Begins with each observation as its own cluster.
  • Fuses observations to create progressively coarser clusters.
  • Produces a dendrogram to visualize the clustering process.

Linkage

  • A distance (dissimilarity) metric is used between observations.
  • Clusters can fuse:
    • Two points
    • One cluster and one point
    • Two clusters
  • The dissimilarity metric between clusters is called linkage.

Common Linkage Types

  • Complete linkage uses the maximum pairwise distance.
  • Single linkage uses the minimum pairwise distance.
  • Average linkage uses the mean of all pairwise distances.
  • Centroid linkage uses distance between cluster centroids.

Complete Linkage

  • Maximal inter-cluster dissimilarity
  • Compute all pairwise distances between cluster A and cluster B
  • Record the largest pairwise distance between the clusters
  • Merge clusters with the smallest maximum distance (i.e., smallest dissimilarity)

Single Linkage

  • Minimal inter-cluster dissimilarity
  • Compute all pairwise distances between cluster A and cluster B
  • Record the smallest pairwise distance between the clusters
  • Merge clusters with the smallest minimum distance (i.e., smallest dissimilarity)
  • Can produce trailing dendrograms with single observations (i.e., many single observations fused with clusters)

Average Linkage

  • Mean inter-cluster dissimilarity
  • Compute all pairwise distances between cluster A and cluster B
  • Record the mean distance between clusters
  • Repeat for all pairwise clusters available
  • Merge clusters with the smallest average

Centroid Linkage

  • Mean inter-cluster dissimilarity
  • Distance between cluster centroids
  • Compute centroids of each cluster
  • Merge clusters whose centroids are closest (i.e., have the smallest distance)

Considerations for Either Tool

Key Considerations

  • Which dissimilarity metric? (distance or correlation)
  • Should variables be standardized?
  • Which linkage type? (for hierarchical)
  • How many clusters?

Correlation vs. Distance

  • Observations may be far apart by distance but close in terms of correlation.
  • Example:
    • Distance metric: heavy vs. light shoppers
    • Correlation metric: shoppers buying similar items

Standardization

  • Without standardization: variables with large scales dominate clustering.
  • With standardization: each variable contributes equally.
  • Recommendation: Standardize unless there is a strong reason not to.

Choice of Linkage

  • Results vary depending on the linkage method.
  • Complete and average linkage:
    • Commonly used
    • Produce easier-to-read dendrograms
  • Single linkage:
    • Useful for identifying outliers
    • May form loosely connected clusters (i.e., observations within a cluster may be less similar than desired).

Choosing the Number of Clusters

  • Apply clustering with different values of \(k\).
  • Compute cluster validity metrics.
  • Plot each metric against cluster size to guide the decision.

Cluster Validity Metrics

Internal Cluster Validity

  • No ground truth available
  • Estimate within-cluster vs. between-cluster variation

Common Metrics

  • Silhouette statistic
  • Dunn index
  • Cubic clustering criterion (SAS)

Silhouette Statistic

For each observation: - Calculate average dissimilarity within its cluster.
- Calculate the minimum average dissimilarity to a different cluster.
- Silhouette value for observation \(i\): \[ \text{Silhouette}_i = \frac{\text{between} - \text{within}}{\max(\text{between}, \text{within})} \] - Average silhouette values summarize overall clustering quality.

Dunn Index

  • Ratio: \[ \frac{\text{minimum distance between points in different clusters}}{\text{maximum distance between points within the same cluster}} \]
  • Higher Dunn index indicates stronger cluster separation.

Additional Metrics

  • Davies-Bouldin index (DB)
  • Maulik-Bandyopadhyay index (MB)
  • Saitta score function (SF)

The Clustering Challenge

  • Results depend on choices made (distance metric, standardization, linkage type).
  • No single method is perfect for determining the number of clusters.
  • Avoid relying on any single clustering result as absolute truth.
  • Examine multiple clustering solutions to identify general patterns.
  • Simulation studies show clustering performance depends on cluster separation, noise, and structure.
  • Issues like sub-clusters or overlapping groups can lead to incorrect \(k\) estimates or misleading assignments.

Heatmaps

Motivation

  • Dendrograms and cluster assignments do not show the underlying data.
  • Heatmaps provide a visual check and help interpret clusters.

Heatmap Details

  • Plot of raw or standardized numeric variables
  • Can represent:
    • Data matrices of observations
    • Summary matrices (e.g., means, correlations)
  • Grid of rectangles colored by numeric values

Heatmaps with Clustering

  • Hierarchical clustering can be applied to rows and/or columns.
  • Helps investigate patterns among:
    • Observations (rows)
    • Variables (columns)

Clustering in Practice

  • Goal: identify groups that are closer together (distance or correlation).
  • Definition of ‘observation’ varies depending on context.

Possible uses:

  • Clustering correlation statistics
  • Clustering variables (i.e., treating variables as the observations by transposing the data matrix).
    • Variables often cluster based on strong positive correlations.