Multiple Logistic Regression

Objectives

  • Introduce multiple logistic regression as a tool for classification and statistical inference.
  • Explore the benefits of using multiple predictors, including adjusting for confounders.
  • Understand how to interpret coefficients and interactions.
  • Learn how to assess model fit and compare alternative models.
  • Review strategies for feature selection and model complexity.
  • Apply a general modeling workflow in practical settings.

Motivations for Multiple Logistic Regression

Simple vs. Multiple Logistic Regression

  • Simple Logistic Regression:
    • Assumes there are no confounding variables.
    • Estimates the effect of a single predictor without accounting for other factors.
  • Multiple Logistic Regression:
    • Adjusts for additional variables while estimating the effect of one predictor.
    • Allows for both numerical and categorical predictors.

Simpson’s Paradox

  • An example where a confounding variable distorts the observed relationship between predictors and the response.
  • Crane/Eagle – Math/Physics Example:
    • Math students are more likely to pass than physics students.
    • Most Eagle students were in math, while Crane had an even mix.
    • A single 2×2 table masks this confounding structure.
    • Logistic regression enables comparisons between schools while holding department fixed.
    • Can also test for interaction effects (e.g., school × department).

Prediction vs. Explanation

  • Identifying important health risk factors is an explanatory modeling goal—a statistical inference problem.
    • The goal is not just to detect which variables matter, but to quantify their effects while accounting for other factors.
  • In contrast, clinicians may also be interested in predicting the probability of disease for individual patients.
    • The emphasis is on accurate predictions, not necessarily understanding the individual role of each variable.

Case Study Examples

  • CAD Study: Predicting coronary artery disease using sex, age, and ECG results.
    • Questions: Is sex still a risk factor after adjusting for age and ECG? Are there interaction effects?
  • Titanic: Predicting survival based on ticket class, age, and sex.
    • Questions: Which factors had the largest impact on survival? Do the effects depend on each other?

The Multiple Logistic Regression Model

For multiple predictors \(X = (X_1, X_2, \ldots, X_p)\), the general logistic regression model is: \[ \log\left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p \]

This model expresses the log odds of the outcome as a linear combination of predictors.

Interpretation

  • Additive models:
    • Each predictor appears once in the model.
    • \(e^{\hat{\beta}}\) is interpreted as an odds ratio.
    • Interpretation assumes we are “holding all other variables fixed.”
  • Complex models:
    • Include interaction terms and/or polynomial terms.
    • Odds ratios are still interpretable but require careful consideration of the regression formula.
    • Effects plots are commonly used for visual interpretation in these cases.

Odds Ratios in Complex Models

When a model includes interaction terms, odds ratios must be interpreted in the context of the full model. Here is an example:

Model:
\[ \log\left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Smoking} + \beta_3 \cdot (\text{Age} \times \text{Smoking}) \]

  • Age: continuous
  • Smoking: binary (1 = smoker, 0 = non-smoker)

Interpretation:

  • For non-smokers (Smoking = 0):
    Odds ratio for a 1-year increase in age is \(\exp(\beta_1)\)
  • For smokers (Smoking = 1):
    Odds ratio for a 1-year increase in age is \(\exp(\beta_1 + \beta_3)\)

Key idea: Odds ratios remain interpretable but must account for interaction terms and the full model formula.

Classification Boundaries

  • Additive models:
    • Produce linear decision boundaries (when predictors are numeric)
    • Similar to LDA but without assuming normality of predictors
  • Complex models:
    • Can produce nonlinear boundaries, depending on the terms included:
      • Categorical × continuous: separate linear boundary per category
      • Continuous × continuous or polynomial: curved or non-parallel boundaries
      • Categorical × categorical: nonlinear effects, but not meaningful to describe in terms of a boundary shape

Feature Selection

Penalized Logistic Regression

  • Use glmnet with family = "binomial" to fit logistic regression models.
  • Useful for both regularization and feature selection.

Stepwise Selection

  • Common approaches include forward, backward, and stepwise procedures.
  • Best subset selection is also possible.
  • Model comparison metrics:
    • AIC (Akaike Information Criterion)
    • Log loss or Brier Score, evaluated on a validation set or via K-fold cross-validation

Separability Problem

  • When training data is perfectly separated:
    • This signals strong, influential predictors.
    • However, the maximum likelihood estimate (MLE), i.e., the regression coefficient that minimizes the log loss, becomes \(\hat{\beta} = \infty\).
    • Software may issue warnings or fail to converge.
    • Predicted probabilities become exactly 0 or 1.
      • No uncertainty estimate, confidence intervals, or inference is possible.
  • Solutions:
    • For explanation: use penalization or Firth’s logistic regression
    • For prediction: consider Firth’s logistic, penalized logistic (e.g., glmnet), or alternative models (e.g., LDA)

Complex Logistic Models

Modeling Strategy

  • When including higher-order interaction terms:
    • Include the corresponding lower-order interactions
      • e.g., if modeling a 3-way interaction, also include all 2-way interactions
    • Use ANOVA-style tests to assess overall significance at each level of complexity
    • Remove non-significant higher-order interactions and reassess model fit
    • Compare models using:
      • Hosmer-Lemeshow test
      • Validation or cross-validation metrics

Coefficient Interpretation

  • Use a similar strategy as in multiple linear regression:
    • Identify the appropriate contrast to combine regression coefficients
      • Allows estimation of specific comparisons or trends from significant interactions
    • Exponentiate combined coefficients to obtain odds ratio interpretations
  • In more complex models:
    • Interpretation becomes challenging
    • Use effects plots to visualize and communicate model predictions

Effects Plots

  • Display predicted probabilities across 1–3 variables
  • Hold other variables fixed:
    • Categorical: reference level (depends on software)
    • Continuous: mean or median (depends on software)

Note: Effects plots are not part of EDA. They visualize predictions from the fitted model, not the raw data. Their appearance reflects the model structure, not the underlying variable relationships.

General Workflow

The overall workflow for multiple logistic regression closely parallels that of multiple linear regression.

Modeling Steps

  1. Define primary question(s) and predictor(s)
  2. Identify potential confounders and covariates
    • These are additional variables you want to account for
  3. Perform EDA and data cleaning
    • Use summary statistics and plots
    • Explore possible interactions
  4. Fit candidate models
    • Guided by findings from EDA
    • Use ANOVA tests or feature selection to choose an appropriate model complexity
  5. Assess model fit
    • Use the Hosmer-Lemeshow test
    • Evaluate performance on a validation set or using K-fold cross-validation
  6. Interpret the final model
    • For additive models: interpret coefficients as odds ratios
    • For complex models with interactions:
      • Use appropriate contrasts to combine coefficients
      • Use effects plots for interpretation and communication

Classification Considerations

  • Try multiple classification methods (e.g., non-parametric approaches) and compare error metrics
  • Choose a classification threshold based on context
    • For example, consider the cost of false positives vs. false negatives
  • Validate both the model and the threshold using:
    • An independent test set
    • An entirely new dataset, if available
  • Set up a monitoring strategy:
    • Regularly check prediction accuracy
    • Adjust the model if data collection methods or population characteristics change