Multiple Linear Regression (MLR) Revisited

Objectives

Understand regression problems and distinguish them from classification problems.
Identify when to use explanatory versus predictive modeling.
Compare parametric and nonparametric regression models and their trade-offs.
Review multiple linear regression (MLR) and its applications.
Evaluate key assumptions of MLR and correctly interpret model coefficients.
Explore techniques to increase model complexity when needed.
Detect and address multicollinearity in regression models.

Key Terms

The response variable (dependent variable) is what we aim to explain or predict.
Explanatory variables (independent variables) are the factors that may influence the response.
Regression: The response is a continuous numeric variable (e.g., predicting car mileage).
Classification: The response variable is categorical and can have two (binary) or more than two levels (e.g., predicting where a car was made: North America, Asia, or Europe).

Example: Regression Problem - MPG Dataset

A common regression problem is predicting a car’s miles per gallon (MPG) using various features:

Response Variable: mpg (miles per gallon)
Explanatory Variables: cylinders, horsepower, origin (3 levels), year, etc.

Regression Modeling Workflow

When analyzing a regression problem, we typically follow these steps:

Exploratory Data Analysis (EDA)
- Visualize relationships between the response and each explanatory variable.
- Identify potential trends, outliers, and transformations.
Key Questions to Consider
- Which variables explain mpg, and how strong are these relationships?
- Are there specific hypotheses to test (e.g., does a car’s origin significantly impact mpg)?
- Can we use this model to predict the mpg of a new car?

Regression Problems

A regression problem occurs when the response variable is continuous. The goal can either be explanation (understanding relationships) or prediction (making accurate future estimates).

To Explain or To Predict?

Explaining Relationships
- Hypothesis testing and confidence intervals
- Identifying relationships between response and predictors
- Adjusting for confounding (lurking) variables
Predicting Future Outcomes
- Focused on accuracy, not interpretation
- More complex models can be used

To determine the appropriate approach, ask what is the primary goal—explanation, prediction, or both?

Parametric vs. Nonparametric Regression Tools

General Regression Model Structure

The regression model follows:

\[ Y = f(X) + \varepsilon \] where:

$X = (x_1, x_2, ..., x_p)$ represents the explanatory variables.
$f(X)$ is the deterministic component.
$\varepsilon$ is the random error.

Key Terminology

Population function: $f$ (true but unknown function in the population).
Sample function: $\hat{f}$ (estimate of $f$ using observed data).

Estimating $f$: Parametric vs. Nonparametric Approaches

There are two primary approaches for estimating $f$:

Parametric
Nonparametric

The choice depends on how the response and predictor variables relate mathematically.

For example:

A numeric predictor like horsepower may follow a quadratic function.
A categorical predictor like cylinders may follow a piecewise function.

Parametric Approach

Requires specifying the functional form of $f(X)$.
Assumes additional conditions about the error term.
Allows hypothesis testing.
Works well when the model is correctly specified.
Simpler models can be more interpretable.

Nonparametric Approach

No need to specify $f(X)$.
Fully data-driven and flexible.
Can capture complex relationships.
More difficult to interpret.
May require larger datasets for reliable estimation.
Does not support traditional hypothesis testing frameworks.

Tradeoffs: Interpretability vs. Flexibility

Approach	Pros	Cons
Parametric	Interpretable Allows hypothesis testing Works well for small datasets Easier for high $p$, low $n$ settings	Cannot model complex relationships well Sensitive to misspecification
Nonparametric	Adapts to complex patterns Less restrictive assumptions Often performs better for large datasets	Requires large datasets No hypothesis testing Harder to interpret

Challenges with Parametric Models

Specifying highly complex forms of $f(X)$ is difficult.
For large datasets, nonparametric models often perform as well or better.
Additional diagnostic checks are required, such as:
- Checking model assumptions (linearity, normality, independence).
- Addressing multicollinearity and influential points.

A Review of Multiple Linear Regression (MLR)

Model Structure

\[ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \varepsilon \]

where:

$Y$ is the response variable.
$X_1, X_2, ..., X_p$ are the explanatory variables.
$\varepsilon$ is the random error term.

Assumptions of MLR

Linearity: The relationship between predictors and response is linear.
Independence: Errors are independent.
Homoscedasticity: Errors have constant variance.
Normality: Errors follow a normal distribution.

Residual Diagnostic Checks

These checks help assess whether assumptions hold:

Residual vs. Fitted Plot – Examines constant variance (homoscedasticity).
Q-Q Plot and/or Histogram – Assesses whether residuals follow a normal distribution.
Variance Inflation Factor (VIF) – Identifies multicollinearity among predictors.
Data Collection Process Review – Ensures independence.
Violations often occur in repeated measures on the same subject or in time series data.

Basic Interpretations

When each predictor is included in the model only once, interpretations follow these general rules:

Continuous predictors: A one-unit increase in $X$ results in a change of $\beta$ in $Y$, holding all other variables constant.
Categorical variables: Represented using dummy variables.

Example: Suppose $X_1$ is categorical with three levels (A, B, C), while $X_2$ and $X_3$ are continuous. The model:

\[ Y = \beta_0 + \beta_1\text{LevelB} + \beta_2\text{LevelC} + \beta_3X_2 + \beta_4X_3 \]

$\beta_1$: Difference in mean response between group B and reference group A, holding other variables constant.
$\beta_2$: Difference in mean response between group C and reference group A, holding other variables constant.
The reference group A is not specifically listed in the model.

Why “holding all other variables constant” is important
A key advantage of MLR is its ability to control for confounding variables.

Example: Sex discrimination in pay

If we suspect a gender-based wage gap, we should control for factors like position level, education, and job role.
A more precise interpretation:
“A male with the same IQ and education level (instead of generically ‘holding all other variables constant’) is estimated to earn $28,463 more than a female counterpart.”

Adding Model Complexity

Transformations

When to transform variables:

Non-constant variance → Transform $Y$.
Nonlinear trends in residual plots → Transform $X$ or add polynomial or interaction terms.

Pros and Cons of Transformations

Helps satisfy model assumptions.
Allows reliable statistical inference on regression coefficients.
Log transformations retain interpretability.
More complex transformations may hinder interpretation (less of a concern for predictive models).

Example: Polynomial regression

Code

lm(y ~ poly(x1, 3), data=mydata) # poly(predictor, degree)

Ways to Increase Model Complexity

Modeling nonlinear trends: Introducing higher-order terms (with multiple coefficients) for a single predictor (e.g., $X^2$ or $X^3$).
Multiple linear trends by category: Using interaction terms to model relationships (between multiple predictors) that vary by group.
Adding predictors one at a time: Results in an additive model where the intercept changes, but the slope remains the same.
Adding interactions: Results in a non-additive model where both the intercept and slope change depending on another variable.

Interactions in MLR

If $X_1$ is numeric and $X_2$ is categorical with two levels:

Reference group (where $X_2=0$): $Y = \beta_0 + \beta_1 X_1$
Non-reference group (where $X_2=1$): $Y = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)X_1$

Hypothesis Testing for Interactions

To test for the significance of an interaction term and generate a confidence interval around the non-reference category, we use a contrast test:

\[ H_0: c_0\beta_0 + c_1\beta_1 + c_2\beta_2 + c_3\beta_3 = 0 \]

Multicollinearity

Multicollinearity occurs when two or more explanatory variables are highly correlated, leading to:

Difficulty in holding other variables fixed.
Increased uncertainty in coefficient estimates (wider intervals, larger p-values).
Drastic changes in conclusions when variables are added or removed.

Variance Inflation Factor (VIF)

VIF measures how much a predictor is correlated with other predictors:

Higher correlation between predictors results in a higher VIF and larger standard errors ($SE(\beta)$).
$SE(\beta)$ is a function of the relationship between the variability of the response, predictors, and the VIF.
VIF $\approx$ 1: No collinearity.
$5 < \text{VIF} < 10$: Mild collinearity, investigate further.
VIF > 10: Severe collinearity, investigate adjustments.

High VIFs are not always concerning.

Expected with polynomial terms and interaction terms.
Common for categorical variables with with more than two levels.
If interpreting only one coefficient, and it has a low VIF, multicollinearity may not be an issue (i.e., if you are just accounting for the other variables).

Addressing Multicollinearity

Scenario 1: Predefined Research Questions

Start with a hypothesis and a plan to test it.
Fit model and check assumptions.
Check for multicollinearity using VIFs and graphics. Consider secondary analysis if multicollinearity is a concern.
Report findings including and excluding variables with multicollinearity concerns.
Possible solutions:
- Aggregate correlated variables into a single variable.
- Use data reduction strategies such as Principal Component Analysis (PCA).

Scenario 2: Exploratory Model Building

Used when the goal is an interpretable model, but no predefined hypothesis exists.
Identify key predictors.
Use model selection techniques to minimize collinearity.

Scenario 3: Predictive Modeling

Multicollinearity is not a concern when the goal is pure prediction.
It only affects coefficient standard errors (SE) and the hypothesis testing framework.
The focus is on model performance, not coefficient interpretation.