Data Screening, Assumptions, and Transformations

Objectives

Review key elements of experimental design that support valid inference.
Understand conditions and assumptions required for t-tests.
Identify and handle non-normality, unequal variance, and outliers.
Apply and interpret log transformations in skewed data.
Distinguish between robustness and resistance in statistical tools.
Use appropriate visual tools to assess assumptions and variance.

Experimental Design

Randomization: Reduces potential bias by ensuring treatment groups are comparable.
Placebo: Controls for confounding variables by ensuring that only the treatment effect—the variable being tested—differs between groups.
Blinding: Minimizes bias in outcomes by preventing subjects or researchers from knowing the treatment assignments.

Conditions for Null Hypothesis Significance Testing (NHST)

Random sampling
Independent observations
Representative of the population: The sample reflects the population of interest.
Quantitative data
Nearly normal distribution
Equal standard deviations for two-sample tests
- Also called homoscedasticity, meaning the groups have the same shape and spread.

Paired t-Test

Used when the assumption of independence is violated due to pairing.
Compares the difference between paired observations using a one-sample t-test.
Common in before-and-after studies or matched-subject designs.

Tools for Checking Normality

Boxplot
- Visualizes the five-number summary.
- Highlights symmetry, skewness, and tail behavior.
- Useful for comparing groups side by side.
Dotplot
- Displays individual data points.
- Easy to construct and interpret for small to moderate sample sizes.
Histogram
- Shows the distribution’s shape and symmetry.
- Useful for identifying skewness or multiple modes.
Normal Quantile (QQ) Plot
- Plots theoretical normal values (X-axis) against observed data (Y-axis).
- Points that align closely along a straight line suggest normality.
- Sensitive to departures from normality, especially in the tails.

Robustness

A procedure is robust if its results remain valid despite minor violations of assumptions.
Example: A 95% confidence interval should still capture the true parameter 95% of the time, even if the data are not perfectly normal.

Moderate Robustness in t-Tools

Sample size effects:
- Larger samples tolerate greater departures from normality.
- Exception: Heavy-tailed distributions (i.e., many large outliers) can still distort results.
Two-sample t-tests:
- Problems arise when the groups have different shapes or skewness, or when standard deviations are unequal.
- The worst-case scenario occurs when the group with the smaller sample size also has the larger standard deviation.
  - In this case, the sample may fail to accurately represent the population’s variability.

Independence

Definition: Observations are independent if knowing one value provides no information about another.
- Independence must be built into the experimental design.
Cluster effects:
- Arise when data are grouped in natural subunits (e.g., littermates, classrooms).
- Violates independence unless each cluster is treated as a single observational unit or analyzed appropriately.
Serial effects:
- Occur when measurements are taken over time (e.g., time series data).
- Nearby values may be correlated, violating independence.

Outliers

Definition: Observations that fall far from the central tendency of the data (e.g., the group average).
Effects:
- Can create long-tailed distributions.
- t-statistics are sensitive to outliers and can be distorted by extreme values.

Dealing with Outliers

Do not remove outliers unless they result from data entry or measurement error.
Run analyses with and without outliers and compare the results.
Report both versions to provide transparency.

Resistance

A procedure is resistant if its results remain stable when small parts of the data change.
- Example: The median is resistant to outliers, while the mean is not.
t-tools are based on means and are not resistant to outliers or long-tailed distributions.

Log Transformation (Natural Log)

When to use:
- The ratio of the largest to smallest value is greater than 10.
- The data are skewed.
- The group with the larger mean also has the larger variance.
Effects:
- Reduces skew, improves normality, and corrects non-constant variance.
- Requires back-transformation to interpret results (e.g., medians, confidence intervals) on the original scale.

Other Transformations

Square root
Inverse
Reciprocal
> Note: These transformations are harder to back-transform than log transformations and may complicate interpretation.

Log Transformations: Propositions from Log Properties

These propositions come from basic properties of logarithms and allow us to translate log-scale comparisons back to the original measurement scale.

Normal distribution: In a normal distribution, the mean equals the median.
Monotonicity: The log function is monotonically increasing, so \(\log(\text{median}(x)) = \text{median}(\log(x))\).
Log differences: \(\log(a) - \log(b) = \log\left(\frac{a}{b}\right)\)
Exponentiation: \(e^{\log(x)} = x\)

t-Test Interpretation (Log-Transformed Data)

We apply the previous propositions to interpret results from a t-test on log-transformed data. Suppose: \[ \text{mean}(\log(x)) - \text{mean}(\log(y)) = \gamma \]

Because the distribution is approximately normal, we assume the mean and median are roughly equal: \[ \text{median}(\log(x)) - \text{median}(\log(y)) = \gamma \]

The logarithmic function is monotonically increasing: \[ \log(\text{median}(x)) - \log(\text{median}(y)) = \gamma \]

Rewriting using the log difference identity: \[ \log\left(\frac{\text{median}(x)}{\text{median}(y)}\right) = \gamma \]

Exponentiating both sides: \[ e^{\log\left(\frac{\text{median}(x)}{\text{median}(y)}\right)} = e^\gamma \]

Therefore: \[ \frac{\text{median}(x)}{\text{median}(y)} = e^\gamma \]

Interpretation

The median of group \(x\) is estimated to be \(e^\gamma\) times the median of group \(y\).

Inequality of Variance

Visual evidence should be the primary tool for detecting unequal variances (e.g., side-by-side boxplots or spread differences in histograms).
F-test: A formal test for equality of variances, but it is sensitive to violations of normality.
- \(H_0\): Population variances are equal.
- \(H_a\): Population variances are not equal.
Use hypothesis tests for equality of variance with caution, and only when assumptions are satisfied.

General Rules of Thumb

When sample sizes are equal and large, t-tools are generally robust.
When standard deviations differ:
- If sample sizes are equal: t-tools are still valid with large samples.
- If sample sizes are unequal: t-tools are not valid and alternative methods should be considered.

Welch’s t-Test

Designed to handle situations where population standard deviations are unequal.
Adjusts the degrees of freedom using the Satterthwaite approximation.
Still assumes normality in the underlying populations.

Non-Normal Distributions

Long-tailed distributions:
- Can produce wider confidence intervals than expected (e.g., greater than the nominal \((1 - \alpha)\%\) level—for example, more than 95% when using a 95% CI), leading to fewer rejections of the null hypothesis.
- These wider intervals increase the chance of capturing the true mean \(\mu\), pushing the actual coverage above \((1 - \alpha)\%\) and resulting in rejection rates below \(\alpha\).
Unequal sample sizes and variances
- When both sample sizes and standard deviations differ substantially, confidence intervals may become too narrow or too wide.
- This can cause actual coverage rates to deviate from the nominal 95% level—either above or below—depending on the direction of the imbalance.

See Displays 3.4 and 3.5 in The Statistical Sleuth (Ramsey and Schafer 2012) for simulation-based evidence.

References

Ramsey, F., and D. Schafer. 2012. The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning. https://books.google.com/books?id=jfoKAAAAQBAJ.