Clustering
Objectives
- Understand the goal of clustering as an unsupervised learning method.
- Learn how k-means and hierarchical clustering assign observations to groups.
- Explore linkage methods and distance metrics used in clustering.
- Evaluate clustering results using cluster validity metrics.
- Visualize clustering output using dendrograms and heatmaps.
Motivation for Clustering
When plotting two numerical variables:
- Are all observations part of one homogeneous group?
- Or do some form distinct subgroups that behave more similarly to each other?
Key points:
- Clustering is an unsupervised learning method.
- Each data point is assigned to a group (cluster).
- Points within a cluster are more similar.
- Points in different clusters are more dissimilar.
Heterogeneous Populations
- Most populations are heterogeneous.
- Benefits of identifying subpopulations:
- Deeper EDA and sanity checks (for classification problems)
- Unique trends may require unique models
- Better decision-making in downstream tasks
- Deeper EDA and sanity checks (for classification problems)
k-Means Clustering
Metric for Cluster Fit
Assuming observations are already assigned to clusters, there is a trade-off between:
- Within-cluster variability
- Between-cluster variability
Within-cluster metric:
- Let \(k\) be the number of clusters.
- Let \(C_k\) be the set of observations assigned to cluster \(k\).
- Define within-cluster variability \(W(C_k)\) as: \[ W(C_k) = \frac{1}{|C_k|} \sum_{i,j \in C_k} d(x_i, x_j)^2 \] where, \(|C_k|\) is the number of observations in cluster \(k\), and \(d(x_i, x_j)\) is the Euclidean distance between two points in the same cluster.
- The goal is to minimize overall within-cluster variability.
- However, within-cluster variability can always be minimized by increasing \(k\), so the key question is: how many clusters are actually present?
Challenges in Minimizing \(W(C_k)\)
- There are a large number of possible cluster assignments.
- The number of possible assignments increases with \(k\) (number of clusters) and \(n\) (sample size).
- k-means clustering may converge to a local minimum rather than the global best solution.
- The number of clusters \(k\) must be specified upfront to initialize the algorithm.
k-Means Algorithm
- Randomly assign observations to cluster groups.
- Reassignment:
- Compute the centroid of each group (i.e., the mean of the assigned observations).
- Reassign observations to the nearest centroid.
- Compute the centroid of each group (i.e., the mean of the assigned observations).
- Repeat step 2 until within-cluster variability converges.
Defining “Close”
Distance metrics:
- Euclidean distance
- Mahalanobis distance
- Manhattan distance
Hierarchical Clustering
- Does not require specifying \(k\) in advance.
- Begins with each observation as its own cluster.
- Fuses observations to create progressively coarser clusters.
- Produces a dendrogram to visualize the clustering process.
Linkage
- A distance (dissimilarity) metric is used between observations.
- Clusters can fuse:
- Two points
- One cluster and one point
- Two clusters
- Two points
- The dissimilarity metric between clusters is called linkage.
Common Linkage Types
- Complete linkage uses the maximum pairwise distance.
- Single linkage uses the minimum pairwise distance.
- Average linkage uses the mean of all pairwise distances.
- Centroid linkage uses distance between cluster centroids.
Complete Linkage
- Maximal inter-cluster dissimilarity
- Compute all pairwise distances between cluster A and cluster B
- Record the largest pairwise distance between the clusters
- Merge clusters with the smallest maximum distance (i.e., smallest dissimilarity)
Single Linkage
- Minimal inter-cluster dissimilarity
- Compute all pairwise distances between cluster A and cluster B
- Record the smallest pairwise distance between the clusters
- Merge clusters with the smallest minimum distance (i.e., smallest dissimilarity)
- Can produce trailing dendrograms with single observations (i.e., many single observations fused with clusters)
Average Linkage
- Mean inter-cluster dissimilarity
- Compute all pairwise distances between cluster A and cluster B
- Record the mean distance between clusters
- Repeat for all pairwise clusters available
- Merge clusters with the smallest average
Centroid Linkage
- Mean inter-cluster dissimilarity
- Distance between cluster centroids
- Compute centroids of each cluster
- Merge clusters whose centroids are closest (i.e., have the smallest distance)
Considerations for Either Tool
Key Considerations
- Which dissimilarity metric? (distance or correlation)
- Should variables be standardized?
- Which linkage type? (for hierarchical)
- How many clusters?
Correlation vs. Distance
- Observations may be far apart by distance but close in terms of correlation.
- Example:
- Distance metric: heavy vs. light shoppers
- Correlation metric: shoppers buying similar items
Standardization
- Without standardization: variables with large scales dominate clustering.
- With standardization: each variable contributes equally.
- Recommendation: Standardize unless there is a strong reason not to.
Choice of Linkage
- Results vary depending on the linkage method.
- Complete and average linkage:
- Commonly used
- Produce easier-to-read dendrograms
- Commonly used
- Single linkage:
- Useful for identifying outliers
- May form loosely connected clusters (i.e., observations within a cluster may be less similar than desired).
- Useful for identifying outliers
Choosing the Number of Clusters
- Apply clustering with different values of \(k\).
- Compute cluster validity metrics.
- Plot each metric against cluster size to guide the decision.
Cluster Validity Metrics
Internal Cluster Validity
- No ground truth available
- Estimate within-cluster vs. between-cluster variation
Common Metrics
- Silhouette statistic
- Dunn index
- Cubic clustering criterion (SAS)
Silhouette Statistic
For each observation: - Calculate average dissimilarity within its cluster.
- Calculate the minimum average dissimilarity to a different cluster.
- Silhouette value for observation \(i\): \[
\text{Silhouette}_i = \frac{\text{between} - \text{within}}{\max(\text{between}, \text{within})}
\] - Average silhouette values summarize overall clustering quality.
Dunn Index
- Ratio: \[ \frac{\text{minimum distance between points in different clusters}}{\text{maximum distance between points within the same cluster}} \]
- Higher Dunn index indicates stronger cluster separation.
Additional Metrics
- Davies-Bouldin index (DB)
- Maulik-Bandyopadhyay index (MB)
- Saitta score function (SF)
The Clustering Challenge
- Results depend on choices made (distance metric, standardization, linkage type).
- No single method is perfect for determining the number of clusters.
- Avoid relying on any single clustering result as absolute truth.
- Examine multiple clustering solutions to identify general patterns.
- Simulation studies show clustering performance depends on cluster separation, noise, and structure.
- Issues like sub-clusters or overlapping groups can lead to incorrect \(k\) estimates or misleading assignments.
Heatmaps
Motivation
- Dendrograms and cluster assignments do not show the underlying data.
- Heatmaps provide a visual check and help interpret clusters.
Heatmap Details
- Plot of raw or standardized numeric variables
- Can represent:
- Data matrices of observations
- Summary matrices (e.g., means, correlations)
- Data matrices of observations
- Grid of rectangles colored by numeric values
Heatmaps with Clustering
- Hierarchical clustering can be applied to rows and/or columns.
- Helps investigate patterns among:
- Observations (rows)
- Variables (columns)
Clustering in Practice
- Goal: identify groups that are closer together (distance or correlation).
- Definition of ‘observation’ varies depending on context.
Possible uses:
- Clustering correlation statistics
- Clustering variables (i.e., treating variables as the observations by transposing the data matrix).
- Variables often cluster based on strong positive correlations.