Principal Component Analysis

Objectives

  • Introduce Principal Component Analysis (PCA) as a tool for unsupervised data analysis and dimensionality reduction.
  • Describe the difference between supervised and unsupervised analysis.
  • Explain the motivation for reducing data dimensionality.
  • Explore practical applications of PCA.
  • Review the technical foundations of PCA.
  • Introduce nonlinear dimensionality reduction methods (t-SNE, MDS).

What Is PCA?

Supervised vs. Unsupervised Learning

  • Supervised learning: Both predictors and the response variable are provided. Models aim to:
    • Explain relationships.
    • Perform hypothesis testing and build confidence intervals.
    • Make predictions.
  • Unsupervised learning: Only predictors are provided.
    • No response variable is used.
    • Goals vary; interpretation is often subjective.
    • Not used for prediction or traditional explanation as with linear and logistic regression.
    • Typically more challenging than supervised methods.
    • No standard or easy way to validate results on future data.
    • Common applications include:
      • Exploratory data analysis (EDA) to support supervised analysis.
      • Identifying subgroups or patterns.
      • Improving computational efficiency.
      • Simplifying downstream prediction tasks.

Unsupervised Tools

Common techniques include:

  • Data reduction:
    • Principal Component Analysis (PCA)
    • t-SNE
    • Multidimensional Scaling (MDS)
  • Clustering:
    • Hierarchical
    • k-means clustering

PCA and Data Reduction

Motivation for Data Reduction

  • When there are too many variables, it becomes difficult to analyze, visualize, or model the data.
  • The goal of PCA is to create a smaller set of variables that preserves as much information as possible.
  • What counts as “information” depends on the algorithm.
    • For PCA, information is defined in terms of variability.

Principal Components

  • PCA creates new variables \(Z_1, Z_2, \dots, Z_p\) called principal components.
  • Each component \(Z_j\) is a linear combination of the original \(X_i\) variables.
    • Each principal component uses different weights (coefficients): \[ Z_1 = \phi_{11}X_1 + \phi_{12}X_2 + \dots + \phi_{1p}X_p \] \[ Z_2 = \phi_{21}X_1 + \phi_{22}X_2 + \dots + \phi_{2p}X_p \] \[ \vdots \] \[ Z_p = \phi_{p1}X_1 + \phi_{p2}X_2 + \dots + \phi_{pp}X_p \]
  • PCA always produces the same number of principal components as the original number of variables.
    • So where is the “reduction”?
      • Do we keep only a few?
      • If so, how many? Which ones? Based on what criterion?

Properties of Principal Components

  • Uncorrelated with each other
  • Ordered by variance: \(\text{Var}(Z_1) > \text{Var}(Z_2) > \dots > \text{Var}(Z_p)\)
  • Total variance is preserved: \[ \sum_{i=1}^p \text{Var}(X_i) = \sum_{i=1}^p \text{Var}(Z_i) \]

Reducing Dimensions

  • Because components are ordered by variance:
    • Later components often have very low variance—some essentially zero.
    • These components carry little information and can be dropped.
    • We keep only the first \(k\) components where \(k < p\), and \(p\) is the original number of variables.
  • Total variability is approximately preserved: \[ \sum_{i=1}^p \text{Var}(X_i) \approx \sum_{i=1}^k \text{Var}(Z_i), \quad \text{where } k < p \]

Performing PCA

From Data to Components

  • PCA requires:
    • Centering (and possibly scaling) the data:
      • Centering means subtracting the mean from each variable.
    • Computing either:
      • The covariance matrix (if the data are unscaled)
      • The correlation matrix (if the data are standardized; this is the standard operating procedure)
        • Standardizing (scaling) transforms the data to z-scores, allowing PCA to focus on relative variation rather than absolute scales.
  • From the matrix, extract:
    • Eigenvectors: the weights (loadings) used to compute \(Z_i\)
    • Eigenvalues: the variance of each \(Z_i\)
    • Each eigenvalue has an associated eigenvector. There are \(p\) of each.

Matrix View

  • The eigenvector for \(Z_1\) is \((\phi_{11}, \phi_{12}, \dots, \phi_{1p})\).
  • The eigenvector matrix contains all loading coefficients. Eigenvectors are arranged columnwise.
    • Transpose the loading matrix and multiply it by the original variable matrix to obtain \(p\) rows (one per principal component).
  • Eigenvalues quantify the variance of each principal component. These are sometimes expressed in matrix form.

Scree Plots and Component Selection

  • Scree plots:
    • Plot eigenvalues in decreasing order.
    • Plot variance proportions: \(\dfrac{\lambda_i}{\sum_{j=1}^p \lambda_j}\), where \(\lambda_i\) is the \(i\)th eigenvalue.
    • Plot the cumulative proportion of variance.
  • Strategies for reduction:
    • Keep enough principal components (PCs) to explain approximately 80–90% of the variance (cutoff may be arbitrary).
    • Keep the first 3–4 PCs for visualization in EDA.
    • Look for an “elbow” in the scree plot, where additional components contribute little to the explained variance.

Applications of PCA

Simplify Regression and Classification

  • Apply PCA to the predictors only.
  • The resulting predictors (PCs) are uncorrelated—no multicollinearity.
  • Reduces dimensionality (i.e., fewer predictors, more compact model).
  • Important notes:
    • PCA is unsupervised!
    • The PCs may not align better with the response variable than the original predictors.
    • PCA should be viewed as a preprocessing step, not a predictive tool.

PCA for EDA in Classification

Purpose

  • Gain insight into whether the predictors might work well—this is not for model fitting.
  • Consider using feature selection instead:
    • If the project requires using original predictors (e.g., for interpretability)
    • When the number of predictors is very large

Strategy

  • Plot PCs to visualize group separation.
  • If clear separation exists, the predictors may be promising.
  • If not:
    • Predictors may lack signal or be irrelevant.
    • Alternatively, nonlinear methods may be needed:
      • Try nonlinear data reduction or increase model complexity.
  • Use this as a sanity check: Is good prediction accuracy likely from these variables?

Image Compression

  • Reduce the number of pixels (variables).
  • PCA compresses the data by reducing dimensionality.
  • Reconstruct the image from a limited number of components.

Common Pitfalls

  • Never include the response variable in PCA.
  • Interpreting component coefficients can be challenging.
    • Strategies exist, such as examining the eigenvectors to summarize key contributions.
  • PCA is sensitive to outliers.
  • Categorical variables can cause misleading results, especially with arbitrary numeric coding.
  • PCA preserves only variance:
    • Allows visualization of the “global” structure (e.g., linear separation)
    • But proximity in PC space does not guarantee proximity in the original space

Nonlinear Data Reduction

Beyond Linear PCA

  • Linear PCA maps data to a linear plane:
    • Reduces the dimensionality of the data
    • Projects observations onto the “closest” plane or hyperplane
  • Nonlinear methods allow curved manifolds:
    • Project points onto lower-dimensional, nonlinear surfaces
    • Require careful selection of parameters and algorithms
    • Often used strictly for EDA, not as inputs for predictive models

Common Nonlinear Reduction Techniques

  • Kernel PCA
    • Uses kernel functions (e.g., polynomial, radial basis function (RBF))
    • Performs PCA in a transformed feature space
    • Can be used in predictive models
  • Multidimensional Scaling (MDS)
    • Preserves distances or similarities between points
    • Typically used for 2D/3D visualizations
    • Not directly used in predictive models
  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
    • Nonparametric
    • Captures local structure using probability models (i.e., models distances between points probabilistically)
      • Seeks a lower-dimensional map that preserves the distribution of pairwise distances
    • Not used in predictive models
    • Results can vary based on hyperparameters like perplexity
    • Sometimes PCA is applied first to reduce computation time

PCA: Technical Details

Matrix Decomposition

  • Assume multiple variables follow a multivariate distribution.
  • Principal components are:
    • Uncorrelated
    • Have variances equal to their corresponding eigenvalues
    • Computed using eigenvectors
      • These eigenvectors provide the coefficients for creating the principal components as linear combinations of the original variables.
  • Decompose \(\Sigma\) or \(R\): \[\Sigma = \phi \Lambda \phi' \] where:
    • \(\phi\) = matrix of eigenvectors
    • \(\Lambda\) = diagonal matrix of eigenvalues
  • Key property: the eigenvector matrix is orthonormal, so its transpose is its inverse: \[ \phi^\top \phi = \phi \phi^\top = I \]