Objectives
- Understand regression problems and distinguish them from classification problems.
- Identify when to use explanatory versus predictive modeling.
- Compare parametric and nonparametric regression models and their trade-offs.
- Review multiple linear regression (MLR) and its applications.
- Evaluate key assumptions of MLR and correctly interpret model coefficients.
- Explore techniques to increase model complexity when needed.
- Detect and address multicollinearity in regression models.
Key Terms
- The response variable (dependent variable) is what we aim to explain or predict.
- Explanatory variables (independent variables) are the factors that may influence the response.
- Regression: The response is a continuous numeric variable (e.g., predicting car mileage).
- Classification: The response variable is categorical and can have two (binary) or more than two levels (e.g., predicting where a car was made: North America, Asia, or Europe).
Example: Regression Problem - MPG Dataset
A common regression problem is predicting a car’s miles per gallon (MPG) using various features:
- Response Variable:
mpg (miles per gallon)
- Explanatory Variables:
cylinders, horsepower, origin (3 levels), year, etc.
Regression Modeling Workflow
When analyzing a regression problem, we typically follow these steps:
- Exploratory Data Analysis (EDA)
- Visualize relationships between the response and each explanatory variable.
- Identify potential trends, outliers, and transformations.
- Key Questions to Consider
- Which variables explain
mpg, and how strong are these relationships?
- Are there specific hypotheses to test (e.g., does a car’s origin significantly impact
mpg)?
- Can we use this model to predict the
mpg of a new car?
Regression Problems
A regression problem occurs when the response variable is continuous. The goal can either be explanation (understanding relationships) or prediction (making accurate future estimates).
To Explain or To Predict?
- Explaining Relationships
- Hypothesis testing and confidence intervals
- Identifying relationships between response and predictors
- Adjusting for confounding (lurking) variables
- Predicting Future Outcomes
- Focused on accuracy, not interpretation
- More complex models can be used
To determine the appropriate approach, ask what is the primary goal—explanation, prediction, or both?
A Review of Multiple Linear Regression (MLR)
Model Structure
\[
Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \varepsilon
\]
where:
- \(Y\) is the response variable.
- \(X_1, X_2, ..., X_p\) are the explanatory variables.
- \(\varepsilon\) is the random error term.
Assumptions of MLR
- Linearity: The relationship between predictors and response is linear.
- Independence: Errors are independent.
- Homoscedasticity: Errors have constant variance.
- Normality: Errors follow a normal distribution.
Residual Diagnostic Checks
These checks help assess whether assumptions hold:
- Residual vs. Fitted Plot – Examines constant variance (homoscedasticity).
- Q-Q Plot and/or Histogram – Assesses whether residuals follow a normal distribution.
- Variance Inflation Factor (VIF) – Identifies multicollinearity among predictors.
- Data Collection Process Review – Ensures independence.
- Violations often occur in repeated measures on the same subject or in time series data.
Basic Interpretations
When each predictor is included in the model only once, interpretations follow these general rules:
- Continuous predictors: A one-unit increase in \(X\) results in a change of \(\beta\) in \(Y\), holding all other variables constant.
- Categorical variables: Represented using dummy variables.
Example: Suppose \(X_1\) is categorical with three levels (A, B, C), while \(X_2\) and \(X_3\) are continuous. The model:
\[
Y = \beta_0 + \beta_1\text{LevelB} + \beta_2\text{LevelC} + \beta_3X_2 + \beta_4X_3
\]
- \(\beta_1\): Difference in mean response between group B and reference group A, holding other variables constant.
- \(\beta_2\): Difference in mean response between group C and reference group A, holding other variables constant.
- The reference group A is not specifically listed in the model.
Why “holding all other variables constant” is important
A key advantage of MLR is its ability to control for confounding variables.
Example: Sex discrimination in pay
- If we suspect a gender-based wage gap, we should control for factors like position level, education, and job role.
- A more precise interpretation:
“A male with the same IQ and education level (instead of generically ‘holding all other variables constant’) is estimated to earn $28,463 more than a female counterpart.”
Adding Model Complexity
Ways to Increase Model Complexity
- Modeling nonlinear trends: Introducing higher-order terms (with multiple coefficients) for a single predictor (e.g., \(X^2\) or \(X^3\)).
- Multiple linear trends by category: Using interaction terms to model relationships (between multiple predictors) that vary by group.
- Adding predictors one at a time: Results in an additive model where the intercept changes, but the slope remains the same.
- Adding interactions: Results in a non-additive model where both the intercept and slope change depending on another variable.
Interactions in MLR
If \(X_1\) is numeric and \(X_2\) is categorical with two levels:
- Reference group (where \(X_2=0\)): \(Y = \beta_0 + \beta_1 X_1\)
- Non-reference group (where \(X_2=1\)): \(Y = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)X_1\)
Hypothesis Testing for Interactions
To test for the significance of an interaction term and generate a confidence interval around the non-reference category, we use a contrast test:
\[
H_0: c_0\beta_0 + c_1\beta_1 + c_2\beta_2 + c_3\beta_3 = 0
\]
Multicollinearity
Multicollinearity occurs when two or more explanatory variables are highly correlated, leading to:
- Difficulty in holding other variables fixed.
- Increased uncertainty in coefficient estimates (wider intervals, larger p-values).
- Drastic changes in conclusions when variables are added or removed.
Variance Inflation Factor (VIF)
VIF measures how much a predictor is correlated with other predictors:
- Higher correlation between predictors results in a higher VIF and larger standard errors (\(SE(\beta)\)).
- \(SE(\beta)\) is a function of the relationship between the variability of the response, predictors, and the VIF.
- VIF \(\approx\) 1: No collinearity.
- \(5 < \text{VIF} < 10\): Mild collinearity, investigate further.
- VIF > 10: Severe collinearity, investigate adjustments.
High VIFs are not always concerning.
- Expected with polynomial terms and interaction terms.
- Common for categorical variables with with more than two levels.
- If interpreting only one coefficient, and it has a low VIF, multicollinearity may not be an issue (i.e., if you are just accounting for the other variables).
Addressing Multicollinearity
Scenario 1: Predefined Research Questions
- Start with a hypothesis and a plan to test it.
- Fit model and check assumptions.
- Check for multicollinearity using VIFs and graphics. Consider secondary analysis if multicollinearity is a concern.
- Report findings including and excluding variables with multicollinearity concerns.
- Possible solutions:
- Aggregate correlated variables into a single variable.
- Use data reduction strategies such as Principal Component Analysis (PCA).
Scenario 2: Exploratory Model Building
- Used when the goal is an interpretable model, but no predefined hypothesis exists.
- Identify key predictors.
- Use model selection techniques to minimize collinearity.
Scenario 3: Predictive Modeling
- Multicollinearity is not a concern when the goal is pure prediction.
- It only affects coefficient standard errors (SE) and the hypothesis testing framework.
- The focus is on model performance, not coefficient interpretation.