Multiple Linear Regression – Henderson's Statistics Notes for Data Science

Objectives

Learn how to obtain a multiple linear regression (MLR) model.
Understand the meaning of regression coefficients.
Identify the types of data structures that can be analyzed with multiple regression.
Visualize and interpret multidimensional relationships among qualitative and quantitative variables.
Determine when and how to include interaction terms in a model.

Adding a Third Variable

When adding $X_2$ to the model, the relationship between $X_1$ and $Y$ may:
- Remain unchanged, indicating that $X_2$ is unnecessary.
- Become stronger, weaker, or change direction.
- Vary across different values of $X_2$, suggesting an interaction effect.
Typically, explanatory variables $X_1$ and $X_2$ are correlated.
- Ideally, there should be no correlation.
- Correlation introduces changes in the estimated response when additional variables are included.
- Variables should only be added to improve model explanation, not simply to increase fit.

Exploratory Approaches to Adding a Third Variable

Scatterplots:
- Plot each pair: $(X_1, X_2, Y)$
- Identify potential linear relationships.
- Explanatory variables can be weakly correlated but should have a strong linear relationship with the response variable.
Pearson correlation:
- Calculate correlation between all pairs of variables.
- Correlation between explanatory variables should ideally be weaker than each variable’s correlation with the response variable.
- If the correlation between explanatory variables is too strong, they may be redundant, indicating multicollinearity.
Multicollinearity:
- Occurs when explanatory variables are highly correlated, violating the independence assumption.

Regression Equations with Two Predictors

Population equation: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2$
Interpreting slopes:
- The coefficient of an explanatory variable in an MLR model does not usually equal its coefficient in a simple linear regression (SLR), except when $X_2$ is completely independent of $X_1$.
- In MLR, the coefficient represents the effect of changing the value of a predictor while holding all other variables constant.
- In SLR, the coefficient represents the effect of changing the value of a predictor without accounting for other variables.

Benefits of Adding Variables

Improves response prediction.
Increases the proportion of variance explained by the model.
Provides a more realistic description when a single explanatory variable is inadequate to explain the response.

Incorporating Categorical (Dummy) Explanatory Variables

Best practice for a two-category variable:
- Code one category as 0 (the reference group) and the other as 1 (the comparison group).
- The reference category’s mean corresponds to the intercept.
- The coefficient of the indicator variable represents the average difference between the two groups.

Approaches to Handling a Third Variable with Two Categories

SLR: Both categories are combined.
Parallel lines model: Different intercepts, same slope.
Interaction model: Different intercepts and slopes.

Examples of Multiple Linear Regression Models

The following equations illustrate several ways that explanatory variables can enter a multiple linear regression model.

Two predictors, additive model:
\[ \mu(Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \] The mean of $Y$ is modeled as a linear function of $X_1$ and $X_2$, each contributing additively.
One predictor with a quadratic term:
\[ \mu(Y \mid X_1) = \beta_0 + \beta_1 X_1 + \beta_2 X_1^2 \] Includes a squared term for $X_1$ to capture curvature in the relationship.
Two predictors with an interaction term:
\[ \mu(Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 X_2) \] Allows the effect of one predictor to depend on the value of the other.
Logarithmic transformation of predictors:
\[ \mu(Y \mid X_1, X_2) = \beta_0 + \beta_1 \log(X_1) + \beta_2 \log(X_2) \] Models multiplicative relationships between $X_1$, $X_2$, and $Y$.

Assumption of Constant Variance

Constant variance assumption: $\text{Var}\{Y \mid X_1, X_2\} = \sigma^2$
The variance of $Y$ remains the same across all values of $X_1$ and $X_2$.

Interpretation of Regression Coefficients

The regression surface of an MLR model with two explanatory variables is planar:
- $\beta_0$ represents the height of the plane when both predictors are zero.
- $\beta_1$ represents the slope along $X_1$, holding $X_2$ constant.
- $\beta_2$ represents the slope along $X_2$, holding $X_1$ constant.
The effect of an explanatory variable is the change in mean response associated with a one-unit increase in that variable, while keeping all other explanatory variables fixed:
- Effect of $X_1$: $\mu(Y \mid X_1 + 1, X_2) - \mu(Y \mid X_1, X_2) = \beta_1$
- Effect of $X_2$: $\mu(Y \mid X_1, X_2 + 1) - \mu(Y \mid X_1, X_2) = \beta_2$
- The coefficient of each explanatory variable measures its effect at fixed values of the other.
- In the planar model, effects are the same at all levels of the explanatory variable.

Illustration of Regression Plane

**Interpretation of regression coefficients in an additive multiple linear regression model with two explanatory variables.** The planar surface represents the fitted model $\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2$. Each bell curve shows the distribution of $Y$ at a fixed $X_1$ and $X_2$ value. The slope $\beta_1$ is the change in mean response per unit increase in $X_1$, holding $X_2$ constant (illustrated by the gray line through the back row of means). The slope $\beta_2$ is the change in mean response per unit increase in $X_2$, holding $X_1$ constant (illustrated by the purple connector between two bells at the same $X_1$).

Parallel Lines Regression Model

Indicator (dummy) variable:
- Represents two levels of a categorical explanatory variable.
- Takes values 0 (reference group, attribute absent) or 1 (comparison group, attribute present).
- The fit is the same if you reverse the levels of the indicator.
Regression model for an indicator variable:
- When $\text{pred}_2 = 0$:
  \[ \mu(Y \mid X_1, \text{pred}_2 = 0) = \beta_0 + \beta_1 X_1 + \beta_2(0) = \beta_0 + \beta_1 X_1 \]
- When $\text{pred}_2 = 1$:
  \[ \mu(Y \mid X_1, \text{pred}_2 = 1) = \beta_0 + \beta_1 X_1 + \beta_2(1) = (\beta_0 + \beta_2) + \beta_1 X_1 \]
- Interpretation:
  - Slope: $\beta_1$ (same for both categories).
  - Intercept: Adjusted by $\beta_2$ between the two groups.
  - Intercept where $\text{pred}_2 = 0$ is $\beta_0$, and intercept where $\text{pred}_2 = 1$ is $\beta_0 + \beta_2$.
  - The two lines are separated by a constant vertical distance of $\beta_2$.
  - The coefficient of the indicator variable is the difference between the mean response for the indicated category (1) and the reference category (0), at fixed values of the other explanatory variables.

Indicator Variables for Categorical Variables with 3+ Categories

For a categorical variable with $k$ levels, $k - 1$ indicator variables are needed.
The reference category has no indicator variable.
A shorthand notation capitalizes the categorical variable name to represent the set of indicator variables (e.g., $\mu\{\text{response} \mid \text{pred}_1, \text{PRED}_2\}$).

Product Term for Interaction

Interaction occurs when the effect of one explanatory variable depends on another.
An interaction term is the product of two explanatory variables.
Example: Two-level indicator variable:
- General model: $\mu(Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \times X_2)$
- When $X_2 = 0$: $\mu(Y \mid X_1, X_2 = 0) = \beta_0 + \beta_1 X_1$
- When $X_2 = 1$: $\mu(Y \mid X_1, X_2 = 1) = \beta_0 + \beta_2 + (\beta_1 + \beta_3)X_1$
  Here, $(\beta_1 + \beta_3)$ is the slope for $X_1$ when $X_2 = 1$.
- Interpretation (separate slopes model):
  - $\beta_1$: Slope at the reference level.
  - $\beta_3$: Difference in slopes between groups.

Example: Gender, Salary, and Years Model

Parameter Estimates

Variable	DF	Est	SE	t	Pr > \|t \|
Intercept	1	34.19	1.22	28.04	<0.0001
Gender	1	3.35	1.46	2.29	0.0263
Years	1	1.44	-1.33	10.83	<0.0001

The Gender coefficient (3.35) represents the adjustment to the intercept for the reference category (males vs. females).

Gender (Indicator Variable)

0 = Female (reference group)
1 = Male (comparison group)

Model
\[ \text{Salary} = \beta_0 + \beta_1 \cdot \text{Years} + \beta_2 \cdot \text{Gender} \]

All constants to the left:
\[ \text{Salary} = \left( \beta_0 + \beta_2 \cdot \text{Gender} \right ) + \beta_1 \cdot \text{Years} \]

Female intercept: $\beta_0 + \beta_2 \cdot 0 = \beta_0$
Male intercept: $\beta_0 + \beta_2 \cdot 1 = \beta_0 + \beta_2$
Slope for Years: $\beta_1$ (same for both groups)

Predicted Salary Equations

Parallel lines with different intercepts:

Female: $\text{Salary} = 34.19 + 1.44 \cdot \text{Years}$
Male: $\text{Salary} = (34.19 + 3.35) + 1.44 \cdot \text{Years} = 37.54 + 1.44 \cdot \text{Years}$

**Parallel lines regression model for salary by gender, controlling for years of experience.** The vertical distance between the two lines $(\Delta = \beta_2 = 3.35)$ represents the estimated average difference in salary between males and females for the same number of years of experience.

Interpretation: After accounting for years of experience, the estimated difference in mean salaries between males and females is $3,350, with males earning more on average.

Example: Gender × Years Interaction Model

Parameter Estimates

Variable	DF	Est	SE	t	Pr > \|t \|
Intercept	1	35.12	1.67	20.97	<0.0001
Years	1	1.24	0.28	4.38	<0.0001
Gender	1	1.94	2.29	0.85	0.4021
Gender × Years	1	0.26	0.08	3.19	0.0025

Gender coding (indicator variable)

0 = Female (reference group)
1 = Male (comparison group)

Model
\[ \mu(\text{Salary} \mid \text{Years}, \text{Gender}) = \beta_0 + \beta_1 \cdot \text{Years} + \beta_2 \cdot \text{Gender} + \beta_3 \cdot (\text{Gender} \times \text{Years}) \]

Group‑specific equations

Female ($\text{Gender}=0$):
\[ \mu(\text{Salary} \mid \text{Years}, 0) = \beta_0 + \beta_1 \cdot \text{Years} = 35.12 + 1.24 \cdot \text{Years} \]
Male ($\text{Gender}=1$):
\[ \mu(\text{Salary} \mid \text{Years}, 1) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)\cdot \text{Years} \] \[ = (35.12 + 1.94) + (1.24 + 0.26)\cdot \text{Years} = 37.06 + 1.50 \cdot \text{Years} \]

Interpretation

Intercept difference at 0 years (male vs. female): $\beta_2 = 1.94$.
Slope difference (male vs. female): $\beta_3 = 0.26$.
Vertical difference at a fixed number of years $x$:
$\Delta(x) = \beta_2 + \beta_3 x$.
For example, at 10 years, $\Delta(10) = 1.94 + 0.26(10) = 4.54$.

**Interaction model for salary by gender, controlling for years of experience.** Lines are not parallel because the slope differs by gender. The intercept difference at 0 years is $\beta_2=1.94$; the slope difference is $\beta_3=0.26$. The vertical gap at a fixed number of years $x$ equals $\Delta(x)=\beta_2+\beta_3x$ (e.g., 4.54 at 10 years).