Regression Diagnostics and Model Refinement

Objectives

Calculate and interpret residuals, standardized residuals, and studentized residuals.
Use residuals to check regression assumptions.
Apply necessary remedies when assumptions are violated.
Understand the robustness of assumptions (i.e., what can you get away with?)

Robustness of Regression Assumptions

Linearity

Parameter estimates will be misleading if a straight-line model is inadequate or fits only part of the data.
Predictions will be biased, and confidence intervals (CIs) will not appropriately reflect uncertainty.
Remedy: Consider adding polynomial terms (quadratic or cubic) to improve model fit.

Normality

Transformations that correct for normality often address constant variance as well.
Effects of non-normality:
- Coefficient estimates and standard errors: Robust, except with many outliers and small sample sizes.
- Confidence intervals: Affected primarily by outliers (long tails).
- Prediction intervals: Sensitive to non-normality due to reliance on normal distributions.

Constant Variance (Homoscedasticity)

For every value of \(x\), the spread of \(y\) should be the same.
Least squares estimates remain unbiased with slight violations.
Large violations can cause standard errors to underestimate or overestimate uncertainty, leading to misleading confidence intervals and hypothesis tests.
Remedy: Large violations should be corrected using a transformation.

Independence

Parameter estimates: Not affected by violations.
Standard errors: Affected significantly. Violations can lead to underestimated standard errors, which inflate t-statistics and make it easier to incorrectly reject the null hypothesis.
Remedy: Serial and cluster effects require different models.

How Much Deviation Is Acceptable?

Small violations of assumptions generally do not invalidate regression results.
However, large deviations can lead to inaccurate estimates, especially for standard errors, confidence intervals, and p-values.
Transformations can often correct violations and improve interpretability.
Only severe departures from linearity or normality (e.g., due to outliers) typically require alternative methods.

Influential and Outlying Observations

Key Concepts

Influential observations: These are points that, if added or removed, substantially change the regression line (e.g., the slope or intercept).
Leverage: A measure of how far an observation’s \(x\) value is from the mean of all \(x\) values (\(\bar{x}\)).
- Points farther from \(\bar{x}\) have higher leverage.
- Mathematically, leverage increases with the squared distance from \(x_i\) to \(\bar{x}\), relative to the total sum of squares in \(x\). This is closely related to how many standard deviations \(x_i\) is from the mean.
- Leverage is based on the \(x\) values alone; it does not depend on \(y\).

Impact of Outliers

Low leverage, low influence: Minimal effect on estimates.
High leverage, low influence: Far from most data but consistent with the trend; minimal effect on the correlation, standard errors, and regression estimates.
High leverage, high influence: Can distort results by pulling the regression line toward the outlier, resulting in different parameter estimates.

Detecting Influential Observations

Leverage Statistic, \(h_{ii}\)

Measures how far the \(x\) value for an observation is from the mean \(\bar{x}\), in relation to the total spread of \(x\) values. Larger values of \(h_{ii}\) indicate higher leverage.
An observation is considered to have high leverage if \(h_{ii} > \dfrac{2p}{n}\), where \(p\) is the number of parameters in the model (including the intercept).
Formula: \[ h_{ii} = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{j=1}^{n} (X_j - \bar{X})^2} \] or \[ h_{ii} = \frac{1}{n - 1} \left[ \frac{(X_i - \bar{X})}{s_x} \right]^2 + \frac{1}{n} \]

Types of Residuals

Standardized: \(e_i / \text{RMSE}\)
Studentized: \(e_i / \sqrt{\text{MSE} \cdot (1 - h_{ii})}\)
Studentized-deleted (R-student):
- Remove an observation, recalculate the regression, and compute the studentized residual.
- The standard deviation is calculated without the point in question.
- Large residual values indicate potentially influential points.
- Formula: \[ \text{RSTUDENT} = \frac{\text{res}_i}{s_i \sqrt{1 - h_i}} \]
Why different types? All three aim to stabilize variance so that residuals are comparable, but they differ in how they estimate the error term:
- Standardized: Uses a single overall RMSE.
- Studentized: Adjusts for leverage via \((1 - h_{ii})\).
- R-student: Recalculates without the observation for more accurate influence detection.

Leave-One-Out Statistics

These measures assess the impact of each observation by considering the model fit when that observation is omitted.

Let \(\hat{Y}_{i(i)}\) be the predicted value of \(Y_i\) when the \(i\)-th observation is left out of the regression model.

PRESS (Predicted Residual Sum of Squares)

\[ \text{PRESS}_p = \sum_{i=1}^n \left(Y_i - \hat{Y}_{i(i)}\right)^2 = \sum e_{i(i)}^2 \]

Smaller PRESS values indicate better-fitting models.

Cook’s Distance (\(D_i\))

Combines information on residual size and leverage, and is equivalent to comparing predictions from the full model to those from a leave-one-out model: \[ D_i = \sum_{i=1}^n \frac{\left(\hat{Y}_{i(i)} - \hat{Y}_i\right)^2}{p \cdot \text{MSE}} = \frac{1}{p} (\text{studres}_i)^2 \left[\frac{h_i}{1 - h_i}\right] \]
Here, \(p\) is the number of parameters in the model (including the intercept).

Durbin-Watson Test for Independence

\[ d = \frac{\sum_{i=1}^n (e_i - e_{i-1})^2}{\sum_{i=1}^n e_i^2} \]

Values near 0 indicate positive correlation; values near 4 indicate negative correlation. The distribution is symmetric about 2.
Available in R and SAS.

Residual Types Summary

Name	Expression	Use
Residual	\(e_i = y_i - \hat{y}_i\)	Residual plots
Standardized residual	\(r_i = \frac{e_i}{s \sqrt{1 - h_{ii}}}\)	Identify outliers
Studentized residual	\(t_i = \frac{e_i}{s_{(i)} \sqrt{1 - h_{ii}}}\)	Test outlying \(Y\) values
Deleted residual	\(e_{i(i)} = y_i - \hat{y}_{i(i)} = \frac{e_i}{1 - h_{ii}}\)	Calculate PRESS

For studentized residuals, at \(\alpha = 0.05\) we expect about 5% to be greater than 2 or less than –2.

Residual Diagnostics Panel

Graphical Assessment of Residuals

Pattern	Potential Issue	Solution
Linear means, constant SD	Model fits well	No action needed
Curved means, equal SD	Nonlinearity	Transform \(X\)
Curved means, increasing SD	Nonlinearity + heteroscedasticity	Transform \(Y\)
Skewed residuals	Non-normality	Can still model the mean, but CIs/PIs may be unreliable. Consider transformations.
Linear means, increasing SD	Heteroscedasticity	Use weighted regression

Remedies for Violations

Nonlinearity

Add more complexity to the model.
Apply a transformation to \(X\).
Add another variable, which may also help nonconstant variance.

Nonconstant Variance

Transform \(Y\).
Use weighted least squares to down-weight observations with larger variance so they don’t influence the regression model as much as observations closer to the line.

Correlated Errors

Often detected via residual plots, where you may see small values follow small values, and large follow large.
Use time series or spatial models for serial effects within data.

Outliers

Use robust regression.
Check for data entry or other errors. Only delect observations in this situation.

Non-normality

Usually fixed via the above methods, so wait to fix until last.
Consider transforming the data.

Residual Patterns and Suggested Remedies

Residual patterns to watch for. Each panel shows a scatter with bin-wise means and standard deviations (dark line and bars). Patterns suggest remedies: (a) compare the distribution of \(Y\) across \(X\); (b) curvature may call for transforming \(X\); (c) symmetric curvature suggests adding a quadratic term; (d) fan-shaped spread indicates transforming \(Y\); (e) right-skewed residuals should be reported; (f) variance that changes with \(X\) motivates weighted regression.

Nonconstant Variance and Transformations

When the spread of residuals changes with fitted values, transforming the response variable \(Y\) can often stabilize the variance.

Transformation Recommendations

Log transformation
Square root transformation
Other transformations (e.g., reciprocal)

Examples of variance-stabilizing transformations. When the spread of residuals changes with fitted values, transforming the response variable \(Y\) can help stabilize variance. The panels illustrate three common recommendations: (a) a log transformation for decreasing variance, (b) a square root transformation for increasing variance, and (c) a reciprocal transformation for variance proportional to the mean. These transformations change the interpretation of model coefficients, and results may need to be back-transformed for reporting.

Interpretation Considerations

Adjust interpretation to the audience, transformation type, and variables transformed.
Back-transform results when necessary.

Log Transformations: Types and Interpretations

Log transformations can be highly effective and may yield more interpretable regression results.
The three most common forms are log-linear, linear-log, and log-log models.

Useful Properties for Log Transformations

If \(X \sim N(\mu, \sigma)\), then \(\text{mean}(X) = \text{median}(X)\)
\(\log(\text{median}(X)) = \text{median}(\log(X))\)
\(a^{\log_a y} = y\) for \(a > 0\)
\(\log A - \log B = \log\left(\frac{A}{B}\right)\)
\(a \log b = \log b^a\)

Log-Linear Transformation (Log on Response)

\[ \log(\hat{Y}_i) = \beta_0 + \beta_1 X_i \]

Conditional on \(X\), assume \(\log(Y)\) is normally distributed.
Median response: \[ \text{Median}(Y \mid X) = e^{\beta_0 + \beta_1 X} \]
The mean of the logged response is linearly related to \(X\).
Multiplicative interpretation:
A one-unit increase in \(X\) changes the median of \(Y\) by a factor of \(\exp(\beta_1)\): \[ \frac{\text{Median}(Y \mid X+1)}{\text{Median}(Y \mid X)} = \exp(\beta_1) \]

Note

Derivation: From log response to median interpretation

Start from the conditional mean: \[ \mu\{\log(Y) \mid X\} = \beta_0 + \beta_1 X \]
If \(\log(Y) \mid X\) is normal, then mean equals median: \[ \text{Median}\{\log(Y) \mid X\} = \beta_0 + \beta_1 X \]
Use \(\log(\text{Median}(Y)) = \text{Median}(\log Y)\): \[ \log(\text{Median}(Y) \mid X) = \beta_0 + \beta_1 X \]
Exponentiate both sides: \[ \text{Median}(Y) \mid X = e^{\beta_0 + \beta_1 X} \]
For \(X+1\): \[ \text{Median}(Y) \mid X+1 = e^{\beta_0 + \beta_1(X+1)} \]
Take the ratio: \[ \frac{\text{Median}(Y) \mid X+1}{\text{Median}(Y) \mid X} = \frac{e^{\beta_0 + \beta_1 (X+1)}}{e^{\beta_0 + \beta_1 X}} = \frac{e^{\beta_0} e^{\beta_1 (X+1)}}{e^{\beta_0} e^{\beta_1 X}} = \frac{e^{\beta_1 X} e^{\beta_1}}{e^{\beta_1 X}} = e^{\beta_1} \]

Interpretation: A one unit increase in \(X\) is associated with an \(e^{\beta_1}\) multiplicative change in the median of \(Y\).

Linear-Log Transformation (Log on Explanatory Variable)

\[ \hat{Y}_i = \beta_0 + \beta_1 \log (X_i) \]

Interpretation depends on the log base:
- Doubling \(X\) changes \(Y\) by \(\beta_1 \log(2)\) units on average.
- A tenfold increase changes \(Y\) by \(\beta_1 \log(10)\) units.

Note

Derivation: Mean change after doubling \(X\)

Start from: \[ \mu\{Y \mid \log(X)\} = \beta_0 + \beta_1 \log(X) \]
For \(X\) replaced by \(2X\): \[ \mu\{Y \mid \log(2X)\} = \beta_0 + \beta_1 \log(2X) \]
Difference in means: \[ \mu\{Y \mid \log(2X)\} - \mu\{Y \mid \log(X)\} = \big[ \beta_0 + \beta_1 \log(2X) \big] - \big[ \beta_0 + \beta_1 \log(X) \big] \]
Simplify: \[ = \beta_1 \left[\log(2X) - \log(X)\right] = \beta_1 \log\left( \frac{2X}{X} \right) = \beta_1 \log(2) \]

Interpretation:
A doubling of \(X\) is associated with a \(\beta_1 \log(2)\) unit change in the mean of \(Y\).

Log-Log Transformation (Both Variables Logged)

\[ \log (\hat{Y}_i) = \beta_0 + \beta_1 \log (X_i) \]

Models a multiplicative effect:
- Doubling \(X\) changes the median of \(Y\) by a factor of \(2^{\beta_1}\).
- A tenfold increase changes the median of \(Y\) by \(10^{\beta_1}\).

Note

Derivation: Effect of doubling \(X\)

Start from: \[ \mu \{\log(Y) \mid \log(X)\} = \beta_0 + \beta_1 \log(X) \]
If \(\log(Y)\mid X\) is normal, then mean equals median: \[ \text{Median}\{\log(Y) \mid \log(X)\} = \beta_0 + \beta_1 \log(X) \]
Use \(\log(\text{Median}(Y)) = \text{Median}(\log Y)\): \[ \log(\text{Median}(Y) \mid \log(X)) = \beta_0 + \beta_1 \log(X) \]
Exponentiate: \[ e^{\log\big(\text{Median}(Y) \mid \log(X)\big)} = e^{\beta_0 + \beta_1 \log(X)} \]
Simplify: \[ \text{Median}(Y) \mid \log(X) = e^{\beta_0 + \beta_1 \log(X)} \]
For \(X\) replaced by \(2X\): \[ \text{Median}(Y) \mid \log(2X) = e^{\beta_0 + \beta_1 \log(2X)} \]
Ratio: \[ \frac{\text{Median}(Y) \mid \log(2X)}{\text{Median}(Y) \mid \log(X)} = \frac{e^{\beta_0 + \beta_1 \log(2X)}}{e^{\beta_0 + \beta_1 \log(X)}} = \frac{e^{\beta_0}e^{\beta_1 \log(2X)}}{e^{\beta_0}e^{\beta_1 \log(X)}} = \frac{e^{\beta_1 \log(2X)}}{e^{\beta_1 \log(X)}} = e^{\beta_1 \log(2X) - \beta_1 \log(X)} \]
Simplify: \[ = e^{\beta_1[\log(2X) - \log(X)]} = e^{\beta_1 \log\left(\frac{2X}{X}\right)} = e^{\beta_1 \log(2)} = 2^{\beta_1} \]

Interpretation:
A doubling of \(X\) is associated with a \(2^{\beta_1}\) multiplicative change in the median of \(Y\).

Summary Table: Interpretation by Model Type

Model Type	Equation	Interpretation
Log-linear	\(\log Y = \beta_0 + \beta_1 X\)	1 unit increase in \(X\) \(\rightarrow\) \(e^{\beta_1}\) change in median \(Y\)
Linear-log	\(Y = \beta_0 + \beta_1 \log X\)	Doubling \(X\) \(\rightarrow\) \(\beta_1 \log(2)\) change in mean \(Y\)
Log-log	\(\log Y = \beta_0 + \beta_1 \log X\)	Doubling \(X\) \(\rightarrow\) ×\(2^{\beta_1}\) change in median \(Y\)

Note: Logging \(Y\) generally shifts interpretation from the mean to the median.

Formal Test for Lack of Fit

Use: Requires replicated \(X\) values with different \(Y\) observations.
Assumptions:
- Normality of \(Y|X\)
- Independence of \((X, Y)\) pairs
- Constant variance of \(Y\) across all values of \(X\)

Procedure

Fit a linear regression model, obtain \(SS_{\text{res}_{LR}}\).
Fit a separate means model (ANOVA), obtain \(SS_{\text{res}_{SM}}\).
Null hypothesis: Linear model fits.
Alternative: Linear model is inadequate, i.e., the variability cannot be explained by the model.
Compute: \[ F = \frac{(SS_{\text{res}_{LR}} - SS_{\text{res}_{SM}}) / (\text{df}_{LR} - \text{df}_{SM})}{\text{MSE}_{SM}} \]

Guidance: Even a good-fitting model may not be the best model.
- Principle of parsimony: Use the simplest model that explains the most variation.

Strategy

ANOVA compares the equal means model to the linear model.
If the lack of fit test fails to reject, the linear model is adequate and comparable with the best fitting model.

Example in SAS

The following SAS code fits both a simple linear regression model and a separate means model (one mean per iron content level). The ANOVA tables from each model are then used to compute the extra-sum-of-squares (lack-of-fit) test manually.

Code

/* linear regression model */
proc glm data = IronCor;
model Corrosion = IronContent / solution;
run;

/* separate means model (7 groups) */
proc glm data = IronCor;
class IronContent;
model Corrosion = IronContent;
run;

data critval;
criticalValue = finv(0.95, 5, 6);
run;
proc print data = critval;
run;

data pval;
pValue = 1-probf(9.28, 5, 6);
run;
proc print data = pval;
run;

**Boxplot showing variation in corrosion measurements at each iron content level, used in assessing model fit.**

**Fit plot with 95% confidence and prediction intervals for the linear regression model of corrosion versus iron content.**

ANOVA output from SAS for the linear regression model (left) and separate means model (right) used in the lack-of-fit test.

Extra-sum-of-squares table showing the manual calculation of the lack-of-fit test from SAS output.

Source	DF	SS	MS	F	Pr > F
Model	5	91.07	18.21	9.28	0.0009
Error / Full (SMM)	6	11.78	1.96
Total / Reduced (LRM)	11	102.85

\(H_0\): Linear regression model fits well (no lack of fit).
\(H_a\): Separate means model fits better (lack of fit).

Conclusion: The lack-of-fit test compares the linear regression model (reduced) to the separate means model (full). There is strong evidence, to suggest the linear regression model lacks fit with respect to the separate means model, p-value = 0.0009.