Classification

Objectives

  • Understand the distinction between regression and classification.
  • Learn about classification boundaries.
  • Explore classification methods:
    • k-Nearest Neighbors (KNN)
    • Linear and Quadratic Discriminant Analysis (LDA/QDA)
  • Evaluate classification performance using error metrics.

Regression vs. Classification

Main distinction: The response variable in classification is categorical.

  • Binary: Two outcomes (e.g., Yes/No)
  • Multiclass: More than two categories

Why Not Use MLR?

  • Can’t just convert categorical responses into numbers:
    • It doesn’t model trends well.
    • It violates assumptions due to the discreteness of the response.

Predictive Models for Classification

Classification models work in two steps:

  1. Predict the probability that a categorical response will occur, given a set of explanatory variables: \(P(Y = \text{Default} \mid X)\)
  2. Convert that probability into a classification decision based on a threshold.

Comparison

  • For continuous responses: \(Y = f(X)\)
  • For categorical responses: \(P(Y = \text{Default} \mid X) = f(X)\)
    • The model predicts probabilities, not the class itself.
    • Predictions stay between 0 and 1 (unlike MLR, which can go below 0 or above 1).

Why Predict Probabilities?

  • Predictive models give us finer resolution than a simple yes/no.
    • Example: A weather forecast predicting 60% chance of rain is more informative than just “yes” or “no”.
  • Subtle relationships between predictors and the response lead to variability in predicted probabilities, creating a gray area.
    • For instance:
      • \(P(Y = \text{Default} \mid X = 100) = 0.82\)
      • \(P(Y = \text{Default} \mid X = 100) = 0.52\)
    • These two predictions might both get classified as “Default”, even though the model sees them very differently.

Classification as a Decision Rule

  • Classification forces a definitive decision based on the predicted probability: \[ Y = \begin{cases} \text{Default}, & \text{if } f(X) \geq \text{threshold} \\ \text{Not}, & \text{otherwise} \end{cases} \]
  • This means classification doesn’t distinguish between a predicted probability of 0.52 and 0.82 if the threshold is 0.5.

Considerations When Classifying

  • The cost of getting the prediction wrong
  • The prevalence of the classes (how often each class occurs in the data)

Important Realizations

  • Assessing how well a model performs is convoluted:
    • How well does the model predict the true probability of the response?
    • How well does your classification rule behave given the costs and context of your problem?
  • Metrics commonly reported in software:
    • Often assess classification decisions rather than predicted probabilities
    • Are calculated without explicit consideration of the cost of mistakes
    • Are often used to assess model fit, even though they may not reflect probability accuracy
  • It’s up to you to assess whether your decision rule fits your specific context, especially when the costs of different types of errors are not equal.

Terminology

  • Positive class (+): The outcome for which you want to predict the probability
  • Negative class (−): The other possible outcome (in a binary response)
  • These are mathematically arbitrary but can be practically important depending on the question you’re answering.

Classification Boundaries

Parametric Models

These models have always been of the predictive probability type:

  • Logistic Regression
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Naive Bayes

Classification Rule

To convert a predictive probability model into a classification rule, apply a threshold (commonly 0.5): \[ \hat{P}(Y = + \mid X) \geq 0.5 \Rightarrow \text{Classify as } + \]

  • The threshold can be adjusted to account for the cost of making a mistake.

Nonparametric Models

These models originally focused on making classification decisions (not probabilities):

  • Decision Trees
  • Random Forests
  • k-Nearest Neighbors (KNN)

To help standardize functions and workflows, software retroactively added predicted probabilities.

KNN Classification Rule

  • Record the response values of the \(k\) nearest neighbors to \(x_{\text{new}}\)
  • Estimate: \(\hat{P}(Y = + \mid x_{\text{new}}) = \frac{\text{\# of } + \text{ neighbors}}{k}\)

Classification Boundary

  • Every model, once converted to a classification rule, defines a classification boundary.
  • Boundaries help visualize:
    • Parametric vs. nonparametric behavior
    • Simpler vs. more complex models
    • Sensitivity to changes in thresholds

Discriminant Analysis (LDA & QDA)

Overview

  • Discriminant analysis is a parametric approach for classification.
  • Applies only to numeric predictors.
  • Assumes predictors follow a multivariate normal (MVN) distribution within each class.

Common Types

  • Linear Discriminant Analysis (LDA):
    • Decision boundaries are linear (hyperplanes).
  • Quadratic Discriminant Analysis (QDA):
    • Decision boundaries are conic sections (typically quadratic).

Assumptions

In discriminant analysis, the assumptions are on the predictors (not the error terms as in MLR).

  • \(X_+ = (X_1, X_2, \dots, X_p)\) is a set of \(p\) numeric predictors for the positive response class.
  • \(X_- = (X_1, X_2, \dots, X_p)\) is a set of the same \(p\) numeric predictors for the negative response class.

LDA

  • \(X_+ \sim MVN(\mu_+, \Sigma), \quad X_- \sim MVN(\mu_-, \Sigma)\)
  • Equal variance-covariance matrices (same \(\Sigma\)), i.e., the predictors in each group should have the same variance and correlation.
  • Allows different means, i.e., the predictor sets can have different mean vectors for each response. This is desirable for separation between the classes.

QDA

  • \(X_+ \sim MVN(\mu_+, \Sigma_+), \quad X_- \sim MVN(\mu_-, \Sigma_-)\)
  • Covariance matrices differ by class (relaxes assumptions, i.e., allows different \(\Sigma\)s).
  • More flexibility, but more parameters to estimate.
  • Also allows different means.

Technical Insights

  • LDA/QDA uses Bayes’ theorem to perform its predicted probabilities.

Bayes’ Theorem & LDA/QDA

Discriminant analysis (LDA/QDA) uses Bayes’ theorem to calculate predicted probabilities:

Bayes’ theorem for binary classification: \[ P(+ \mid X) = \frac{P(X \mid +)P(+)}{P(X \mid +)P(+) + P(X \mid -)P(-)} \]

Updated Bayes’ theorem using MVN assumptions: \[ P(+ \mid X) = \frac{MVN(\mu_+, \Sigma)P(+)}{MVN(\mu_+, \Sigma)P(+) + MVN(\mu_-, \Sigma)P(-)} \]

  • \(P(+)\) and \(P(-)\) are prior probabilities (i.e., class prevalence).
  • Classification boundary with threshold = 0.5 results in linear (LDA) or quadratic (QDA) boundary.
  • Outliers and class imbalance affect boundaries.

Key Insights

Bayes’ formula is fully specified using:

  • Prior probabilities
  • Predictor value \(X\)
  • Class-specific means and covariance matrices
  • All estimated from the training data

When using a threshold (e.g., 0.5), LDA’s boundary mathematically reduces to a linear boundary.

Additional Notes on Classification Boundaries

  • Outliers affect sample means and variances \(\rightarrow\) can distort the boundary.
  • Prior probabilities affect boundary placement, especially when priors are unequal.
  • Bayes’ theorem generalizes easily to more than two classes.

Complexity of QDA

  • More parameters: each class gets its own covariance matrix
  • The more predictors, the more parameters:
    • In binary classification with \(p\) predictors:
      • LDA estimates \(p + 1\) parameters. Number of parameters increase at same rate as number of predictors.
      • QDA estimates \(\frac{p(p + 3)}{2} + 1\) parameters.
    • Number of parameters grows much faster than increase in number of predictors.
    • Greater risk of overfitting, especially with small sample sizes.
    • Overfitting is mitigated with larger datasets.

Pros and Cons of LDA/QDA

Pros

  • Handles correlated predictors (i.e., as long as the correlation isn’t perfect)
  • Parametric model that works when logistic regression fails due to complete separation of classes in training data
  • Optimal when assumptions are met

Cons

  • Sensitive to outliers
  • Cannot handle categorical predictors due to the MVN assumption
  • No built-in feature selection
  • Suffers from the curse of dimensionality
    • Poor performance with many irrelevant predictors

Error Metrics for Classification

Two Types of Metrics

  1. Scoring Rules: Evaluate how close the predicted probabilities are to the true outcomes
  2. Decision Rule Metrics: Assess the accuracy of classification decisions made using a threshold

Scoring Rules

These assess the probability estimates of your model, not just the final classification.

Log Loss (a.k.a. Entropy)

  • Penalizes incorrect, confident predictions more heavily
  • Increases when predicted probabilities are uncertain (e.g., near 0.5), but the true class is strongly skewed
  • Smaller when predicted probabilities are closer to the truth
    • e.g., predicted probability = 0.9 when the true class is 1
  • Formula: \(\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]\)
    • \(y_i = 1\) if the true class is positive
    • \(y_i = 0\) if the true class is negative
    • \(\hat{p}_i\) = predicted probability of the positive class
  • Has nice mathematical properties that guarantee optimality in certain conditions
  • Used in logistic regression, classification trees, and many other algorithms

Brier Score

  • Measures the mean squared difference between predicted probabilities and actual outcomes
  • Formula: \(\text{Brier} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2\)
    • Like an MSE for probabilities
    • Intuitive and interpretable
    • Can be decomposed into different sources of error (e.g., calibration and refinement)

When to Use Scoring Rules

Scoring rules are ideal for:

  • Comparing models in feature selection
  • Tuning parameters (e.g., \(k\) in KNN, penalty terms in regularized models, or pruning in trees)
  • Estimating regression coefficients in logistic regression

They help ensure that:

  • Predicted probabilities are well-calibrated.
  • Your model aligns well with the true underlying probabilities.

Decision Rule Metrics

These metrics assess how well a model performs after applying a classification threshold to the predicted probabilities.

They are derived from the confusion matrix, which compares predicted vs. actual classifications.

Confusion Matrix

Actual Positive (+) Actual Negative (–)
Predicted Positive correctly predicted + incorrectly predicted +
Predicted Negative incorrectly predicted – correctly predicted –

This matrix summarizes:

  • How often the model predicted correctly
  • How often the model made mistakes
  • The types of mistakes

Most software may label these as:
- True Positive (TP), False Positive (FP)
- False Negative (FN), True Negative (TN)
These are helpful for shorthand but should be interpreted contextually.

The threshold (e.g., 0.5) determines how probabilities are turned into class predictions, which in turn affects all metrics derived from this table.

  • Aggregates and cross references how many times predictions were accurate and how many misclassifications were made using the thresholding strategy.
  • It will change when the threshold changes (depends on the threshold).
  • Used to compare models/optimize tuning parameters for techniques that do not produce a predicted probability.

Common Metrics

Misclassification Rate (MCR)

The proportion of all observations where the model’s classification was wrong (i.e., the global error rate): \[ \text{MCR} = \frac{\text{\# incorrect predictions}}{n} = \frac{\text{FP} + \text{FN}}{n} \]

Accuracy

The proportion of all observations where the model’s classification was correct: \[ \text{Accuracy} = \frac{\text{\# correct predictions}}{n} = \frac{\text{TP} + \text{TN}}{n} = 1 - \text{MCR} \] > Accuracy is an unconditional metric—it doesn’t distinguish between types of errors or account for class imbalance.

Conditional Metrics

These metrics are conditioned on a known true class or prediction. These metrics evaluate accuracy within a subset of observations. They help when accuracy alone is misleading.

Sensitivity (Recall / True Positive Rate)

“Given that the true class is positive, how often did we predict positive?”

  • Focuses on the true + group
  • Measures how well the model identifies actual positives
    \[ \text{Sensitivity} = \frac{\text{correctly predicted +}}{\text{total true +}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

Specificity (True Negative Rate)

“Given that the true class is negative, how often did we predict negative?”

  • Focuses on the true – group
  • Measures how well the model avoids false positives
    \[ \text{Specificity} = \frac{\text{correctly predicted –}}{\text{total true –}} = \frac{\text{TN}}{\text{TN} + \text{FP}} \]

Positive Predictive Value (PPV / Precision)

“Given that we predicted positive, how often was that correct?”

  • Focuses on the predicted + group
  • Measures trustworthiness of a positive prediction \[ \text{PPV} = \frac{\text{correctly predicted +}}{\text{total predicted +}} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]

Negative Predictive Value (NPV)

“Given that we predicted negative, how often was that correct?”

  • Focuses on the predicted – group
  • Measures trustworthiness of a negative prediction
    \[ \text{NPV} = \frac{\text{correctly predicted –}}{\text{total predicted –}} = \frac{\text{TN}}{\text{TN} + \text{FN}} \]

Why So Many Metrics?

Different metrics highlight different aspects of model performance and practical priorities, which can be especially helpful with imbalanced datasets.

Example: Predicting whether a patient needs high-risk surgery

  • positive class = needs surgery
  • negative class = does not need surgery

Suppose:

  • Sensitivity = 90% → correctly detects 90% of patients who need surgery
  • Specificity = 90% → correctly identifies 90% of those who don’t need surgery
  • PPV = 47% → only 47% of patients predicted to need surgery actually do
  • NPV = 98.9% → almost all patients predicted not to need it truly don’t

Doctors need PPV to decide: “How much trust should I put in a positive prediction?”
Even with high sensitivity and specificity, prevalence affects PPV and NPV.

Prevalence and Thresholding

  • Prevalence: The base rate of the + class in the population
    • e.g., 1 in 10 people need surgery → prevalence = 10%
  • Sensitivity and specificity do not depend on prevalence
  • PPV and NPV do depend on prevalence

Changing the Threshold

  • Thresholding affects the confusion matrix and all related metrics.
  • A lower threshold → more positive predictions:
    • Increases sensitivity, reduces specificity
  • A higher threshold → fewer positive predictions:
    • Increases specificity, reduces sensitivity

Choose a threshold based on costs of different errors.

Summary and Takeaways

  • Scoring rules assess probability estimates.
  • Decision rule metrics evaluate classifications.
  • Confusion matrices and derived metrics:
    • Depend on threshold
    • Must reflect real-world context and costs

Errors are not always equal. In many applications, the cost of a false positive is different from a false negative.

  • Focus first on thorough EDA and building a model with good predicted probability estimates.
  • Then determine the threshold that best meets the needs of the problem.