Data Screening, Assumptions, and Transformations
Objectives
- Review key elements of experimental design that support valid inference.
- Understand conditions and assumptions required for t-tests.
- Identify and handle non-normality, unequal variance, and outliers.
- Apply and interpret log transformations in skewed data.
- Distinguish between robustness and resistance in statistical tools.
- Use appropriate visual tools to assess assumptions and variance.
Experimental Design
- Randomization: Reduces potential bias by ensuring treatment groups are comparable.
- Placebo: Controls for confounding variables by ensuring that only the treatment effect—the variable being tested—differs between groups.
- Blinding: Minimizes bias in outcomes by preventing subjects or researchers from knowing the treatment assignments.
Conditions for Null Hypothesis Significance Testing (NHST)
- Random sampling
- Independent observations
- Representative of the population: The sample reflects the population of interest.
- Quantitative data
- Nearly normal distribution
- Equal standard deviations for two-sample tests
- Also called homoscedasticity, meaning the groups have the same shape and spread.
Paired t-Test
- Used when the assumption of independence is violated due to pairing.
- Compares the difference between paired observations using a one-sample t-test.
- Common in before-and-after studies or matched-subject designs.
Tools for Checking Normality
- Boxplot
- Visualizes the five-number summary.
- Highlights symmetry, skewness, and tail behavior.
- Useful for comparing groups side by side.
- Visualizes the five-number summary.
- Dotplot
- Displays individual data points.
- Easy to construct and interpret for small to moderate sample sizes.
- Displays individual data points.
- Histogram
- Shows the distribution’s shape and symmetry.
- Useful for identifying skewness or multiple modes.
- Shows the distribution’s shape and symmetry.
- Normal Quantile (QQ) Plot
- Plots theoretical normal values (X-axis) against observed data (Y-axis).
- Points that align closely along a straight line suggest normality.
- Sensitive to departures from normality, especially in the tails.
- Plots theoretical normal values (X-axis) against observed data (Y-axis).
Robustness
- A procedure is robust if its results remain valid despite minor violations of assumptions.
- Example: A 95% confidence interval should still capture the true parameter 95% of the time, even if the data are not perfectly normal.
Moderate Robustness in t-Tools
- Sample size effects:
- Larger samples tolerate greater departures from normality.
- Exception: Heavy-tailed distributions (i.e., many large outliers) can still distort results.
- Larger samples tolerate greater departures from normality.
- Two-sample t-tests:
- Problems arise when the groups have different shapes or skewness, or when standard deviations are unequal.
- The worst-case scenario occurs when the group with the smaller sample size also has the larger standard deviation.
- In this case, the sample may fail to accurately represent the population’s variability.
- Problems arise when the groups have different shapes or skewness, or when standard deviations are unequal.
Independence
- Definition: Observations are independent if knowing one value provides no information about another.
- Independence must be built into the experimental design.
- Cluster effects:
- Arise when data are grouped in natural subunits (e.g., littermates, classrooms).
- Violates independence unless each cluster is treated as a single observational unit or analyzed appropriately.
- Arise when data are grouped in natural subunits (e.g., littermates, classrooms).
- Serial effects:
- Occur when measurements are taken over time (e.g., time series data).
- Nearby values may be correlated, violating independence.
- Occur when measurements are taken over time (e.g., time series data).
Outliers
- Definition: Observations that fall far from the central tendency of the data (e.g., the group average).
- Effects:
- Can create long-tailed distributions.
- t-statistics are sensitive to outliers and can be distorted by extreme values.
- Can create long-tailed distributions.
Dealing with Outliers
- Do not remove outliers unless they result from data entry or measurement error.
- Run analyses with and without outliers and compare the results.
- Report both versions to provide transparency.
Resistance
- A procedure is resistant if its results remain stable when small parts of the data change.
- Example: The median is resistant to outliers, while the mean is not.
- Example: The median is resistant to outliers, while the mean is not.
- t-tools are based on means and are not resistant to outliers or long-tailed distributions.
Log Transformation (Natural Log)
- When to use:
- The ratio of the largest to smallest value is greater than 10.
- The data are skewed.
- The group with the larger mean also has the larger variance.
- The ratio of the largest to smallest value is greater than 10.
- Effects:
- Reduces skew, improves normality, and corrects non-constant variance.
- Requires back-transformation to interpret results (e.g., medians, confidence intervals) on the original scale.
- Reduces skew, improves normality, and corrects non-constant variance.
Other Transformations
- Square root
- Inverse
- Reciprocal
> Note: These transformations are harder to back-transform than log transformations and may complicate interpretation.
Log Transformations: Propositions from Log Properties
These propositions come from basic properties of logarithms and allow us to translate log-scale comparisons back to the original measurement scale.
- Normal distribution: In a normal distribution, the mean equals the median.
- Monotonicity: The log function is monotonically increasing, so \(\log(\text{median}(x)) = \text{median}(\log(x))\).
- Log differences: \(\log(a) - \log(b) = \log\left(\frac{a}{b}\right)\)
- Exponentiation: \(e^{\log(x)} = x\)
t-Test Interpretation (Log-Transformed Data)
We apply the previous propositions to interpret results from a t-test on log-transformed data. Suppose: \[ \text{mean}(\log(x)) - \text{mean}(\log(y)) = \gamma \]
Because the distribution is approximately normal, we assume the mean and median are roughly equal: \[ \text{median}(\log(x)) - \text{median}(\log(y)) = \gamma \]
The logarithmic function is monotonically increasing: \[ \log(\text{median}(x)) - \log(\text{median}(y)) = \gamma \]
Rewriting using the log difference identity: \[ \log\left(\frac{\text{median}(x)}{\text{median}(y)}\right) = \gamma \]
Exponentiating both sides: \[ e^{\log\left(\frac{\text{median}(x)}{\text{median}(y)}\right)} = e^\gamma \]
Therefore: \[ \frac{\text{median}(x)}{\text{median}(y)} = e^\gamma \]
Interpretation
The median of group \(x\) is estimated to be \(e^\gamma\) times the median of group \(y\).
Inequality of Variance
- Visual evidence should be the primary tool for detecting unequal variances (e.g., side-by-side boxplots or spread differences in histograms).
- F-test: A formal test for equality of variances, but it is sensitive to violations of normality.
- \(H_0\): Population variances are equal.
- \(H_a\): Population variances are not equal.
- \(H_0\): Population variances are equal.
- Use hypothesis tests for equality of variance with caution, and only when assumptions are satisfied.
General Rules of Thumb
- When sample sizes are equal and large, t-tools are generally robust.
- When standard deviations differ:
- If sample sizes are equal: t-tools are still valid with large samples.
- If sample sizes are unequal: t-tools are not valid and alternative methods should be considered.
- If sample sizes are equal: t-tools are still valid with large samples.
Welch’s t-Test
- Designed to handle situations where population standard deviations are unequal.
- Adjusts the degrees of freedom using the Satterthwaite approximation.
- Still assumes normality in the underlying populations.
Non-Normal Distributions
- Long-tailed distributions:
- Can produce wider confidence intervals than expected (e.g., greater than the nominal \((1 - \alpha)\%\) level—for example, more than 95% when using a 95% CI), leading to fewer rejections of the null hypothesis.
- These wider intervals increase the chance of capturing the true mean \(\mu\), pushing the actual coverage above \((1 - \alpha)\%\) and resulting in rejection rates below \(\alpha\).
- Can produce wider confidence intervals than expected (e.g., greater than the nominal \((1 - \alpha)\%\) level—for example, more than 95% when using a 95% CI), leading to fewer rejections of the null hypothesis.
- Unequal sample sizes and variances
- When both sample sizes and standard deviations differ substantially, confidence intervals may become too narrow or too wide.
- This can cause actual coverage rates to deviate from the nominal 95% level—either above or below—depending on the direction of the imbalance.
- When both sample sizes and standard deviations differ substantially, confidence intervals may become too narrow or too wide.
See Displays 3.4 and 3.5 in The Statistical Sleuth (Ramsey and Schafer 2012) for simulation-based evidence.