Communicating with Clients

Objectives

This chapter covers key aspects of communicating statistical analyses effectively, including:

Consulting and interacting with clients
Applying models in practice
Effective communication strategies
Handling missing data
Interpreting Mean Squared Error (MSE)

Consulting

Statistical Toolset

A consultant’s statistical toolbox includes:

Group comparisons
Nonparametric methods
Regression analysis
Feature selection (penalized regression, stepwise selection)
Bootstrap and bagging

Key Questions Before Starting

What should I consider before beginning the analysis?
What should the overall plan be?
How do I communicate my findings effectively?

Challenges for New Analysts

Not fully understanding client needs
Overlooking what the dataset offers

Effective Communication with Clients

Good opening lines when talking to clients with limited statistical knowledge:

“Tell me about your study/project.”
Avoid asking, “How can I help?” or “What do you need?”, as these may bias your understanding.

Extracting Key Information

Study Goals

Is the purpose to explain, predict, or both?
Is the response variable numerical, categorical, count, or something else?
Are there specific questions the client wants to answer?
If predicting, does the client want to quantify how predictors contribute to response changes?

Population and Data Collection

What is the target population, and how was data collected?
Can conclusions apply to a slightly smaller population?
Are some variables hard to obtain/measure or unreliable?
Are measurements repeatable?
Is each observation independent?

Practical Considerations

What are meaningful differences, slopes, or accuracy metrics?
What actions will be taken based on the results?

Why This Information Matters

Response variable type determines the appropriate method (e.g., regression vs. classification).
Understanding the goal (explanation vs. prediction) helps select appropriate statistical tools.
- e.g., selecting a tool that supports hypothesis testing if required
Some people use “predict” loosely, meaning correlation or association tests instead.
Adjusting population definition can impact:
- Handling missing data
- Identifying population restrictions (e.g., specific patient subsets like lupus patients prior to treatment)

Data Reliability and Independence

Is the data an accurate reflection of reality?
- Consider biases in reporting (e.g., self-reported data like race).
- Instrument reliability and potential degradation over time.
- If you measure again, will you get the same value?
  - Helps identify important explanatory variables and confounding variables.
Independence assumption is influenced by data collection methods:
- Students from the same district may not be independent.
- Time series data may have autocorrelation.
- Spatially close observations may be correlated.

Practical vs. Statistical Significance

Ensures results are interpreted correctly and not overemphasized

Considerations for Predictive Modeling

Model readiness for deployment
Need for additional/better predictors
Redefining response variables
Collecting more data for refinement
Identifying whether hypothesis testing is needed for decision-making
Computation needs for model deployment (beyond the scope of this course)

Big Data Considerations

Google Flu Trends (GFT) Case Study

Background: CDC reports flu-related doctor visits with a 1-2 week lag. GFT aimed to predict these visits with a 1-day lag.
Issue: GFT overestimated doctor visits by almost 2x.

Big Data Hubris

Assumes big data fixes all statistical issues, including sampling biases (i.e., sampling that isn’t random).
Many big data sources are not from highly accurate scientific instruments.
“Garbage in, garbage out” principle applies.

Understanding Model Predictions

Flu season coincides with winter.
GFT detected both flu and winter, leading to misleading predictors.
- Important predictors were basketball related (i.e., winter sport related).

Temporal Dependencies in Prediction Models

Errors in prediction models are often not independent over time.
The correlation within prediction errors can be used to improve model accuracy.

Data Snooping and Overfitting

The more predictors included, the higher the chance of falsely finding statistical significance (Type I error).
When the number of predictors is much higher than the number of observations, overfitting is much more likely.

Algorithm Dynamics and Data Stability

Changing the algorithm that produces your dataset can fundamentally alter its properties.
Is the measurement of a variable stable and comparable across cases and over time?

Database Logistics

Just because data is available doesn’t mean all of it should be used.
Consider:
- Sampling appropriately to represent the target population.
- Measurement error risks (i.e., consider the error risk of variables).
- The necessity of specific variables - which ones are practically important?

Regression Models in Practice

General Workflow

The general workflow for regression is a highly iterative process.

Data Processing: Cleaning, curation, dropping variables
Exploratory Data Analysis (EDA):
- Summary statistics
- Visualization
- Light modeling for insight and assumption checking
Model Building:
- Feature selection
- Tuning (cross-validation)
- Manual iteration
- Assumption checking and fixing
Communicating Results:
- Hypothesis testing and interpretation
- Report prediction performance metrics (validation set results)
- Model comparison (provide a table, e.g., non-parametric vs. parametric)
- Deployment strategy

Advice: Avoid excessive iteration to prevent overfitting. Ensure models use the same finalized, processed dataset.

Be mindful that the way missing data is handled can influence which population your model represents. For example, if missing values are not missing at random and are removed, the final dataset may not be representative of the full population.

Communicating Results

Challenges

Iterative nature of analysis makes summarization difficult
Finding the right level of technical detail for the audience
Time and page constraints

Structuring Reports and Presentations

Try to summarize your approach and workflow by mimicking a theoretical linear workflow.
Additional details can be placed in an appendix at the back of a written report or at the end of a presentation.
A well-structured report typically includes:
1. Problem Statement and Presentation Overview
  - Clearly define the objective and goals.
  - Provide a brief outline of the report structure, especially if there are:
    - Multiple objectives
    - Multiple approaches
    - Appendices for more detailed information
2. Data Description and Processing Summary
  - Define key variables and their roles in the analysis.
  - When using coded variables, provide a table mapping codes to actual names.
  - Consider including data types (numeric, categorical, etc.), as different analysts may interpret variables differently.
  - Summarize how data processing was conducted.
3. Exploratory Data Analysis (EDA)
  - Highlight important relationships between variables.
  - Use summary statistics and visualizations to articulate key insights.
4. Results and Interpretation
  - Provide a clear explanation of the final model’s findings.
  - Interpret coefficients and/or evaluate predictive performance.
  - Explain whether the error magnitude is meaningful in the real-world application.
5. Conclusions and Work-in-Progress (WIP)
  - Offer brief global conclusions and next steps.
  - If there are multiple problem statements, repeat steps 3 & 4 as needed.

Data Processing Considerations

Ensure that the final processed dataset aligns with EDA and model inputs.
Summarize missing data handling without excessive technical detail.

Presenting EDA Findings

Summary statistics

Should always be provided. Include mean, standard deviation, 5-number summary (median, min/max), and category counts or proportions.
Provide reader/audience a way to sanity check basic things.
Can provide intuition for what parts of the model is doing. For example, why are you running KNN on predictors that are z-scored (or otherwise scaled)?

Graphics and Tables

Don’t assume that figures speak for themselves—explicitly reference them. (Additional figures can be included in an appendix.)
If a graph is too cluttered, either simplify it or guide the audience’s focus.

Showing Trends

EDA visuals used for model-building may not be the best for presentation.
Focus on how the response variable relates to predictors.
Highlight trends that justify interaction terms or transformations.
Avoid overemphasizing multicollinearity unless it directly affects interpretation.
Consider parsing out numerical and categorical variables with either scatterplots or boxplots.
Build the audience’s intuition of the final model, highlight important relationships and unimportant ones when a variable was part of the question of interest.

Regression Tables

Consider whether the audience needs to see p-values and t-statistics.
- Would confidence intervals and significance markers (*, **, ***, etc.) be clearer?
- Can results be reformatted to show only essential information?
- Avoid overwhelming the audience with statistics and technical details that aren’t being used in interpretation.

Interpreting Mean Squared Error (MSE)

MSE measures prediction error, balancing bias and variance.
A lower MSE does not always mean a better model:
- Practical significance matters:
  - A model predicting hospital stay within ~1 day may be useful, even if another model has a slightly lower MSE but lacks interpretability.
  - A model achieving a lower MSE but only predicting the mean hospital stay may not be useful.
MSE should be considered alongside generalization ability, interpretability, and real-world usefulness.

Handling Missing Data

Types of Missing Data

Missing Completely at Random (MCAR)

The missing observations form a random subset of all observations.
The data removed is consistent with the data kept, meaning:
- Similar summary statistics, correlation behavior, and distributions.
Listwise deletion is acceptable, and imputation is also an option.

Missing at Random (MAR)

The probability of missingness depends on observed covariates but not on the missing values themselves.
- If you stratify the data by a known variable, the missing data within each subgroup is MCAR.
- Example: If income values are missing more often for older individuals, but within each age group the missingness is random, then the data is MAR.
Listwise deletion is generally not recommended, but may be acceptable if the covariate influencing missingness is not important in the model.
Imputation is preferred if the covariate affects the response variable or other predictors.

Missing Not at Random (MNAR)

A systematic reason exists for why certain values are missing.
- The probability of missingness depends on the missing values themselves, meaning the reason for missingness is unknown or systematic.
- Example: If people with higher incomes are less likely to report their salary, then missingness depends on the unobserved income values themselves
Catch-all category for cases that do not satisfy MCAR or MAR conditions.
Listwise deletion is not okay, and imputation may not be possible without strong assumptions.

Deletion Considerations

Most statistical software deletes rows with missing data by default, but this:
- Reduces sample size.
- Changes population representation, potentially introducing bias.
When is deletion acceptable?
- MCAR: Deletion is valid because missingness is random.
- MAR: Deletion may be justifiable if the missingness is due to a covariate that is not relevant to the analysis.
  - However, if the covariate affects the response or other predictors, deletion is not recommended, and imputation should be considered.
- MNAR: Deletion should not be used, as missingness is related to the values themselves and may distort the results.
- With MCAR or MAR, consider:
  - Summary statistics and comparisons of deleted vs. retained data.
  - Whether deletion skews the population of interest.

Imputation Methods

There are multiple ways to replace missing values with reasonable estimates.

Simple Imputation

Replace missing values with mean, median, or mode (for categorical values).
Limitations:
- Only reasonable when missing data is truly random (MCAR).
- Can distort distributions if a large proportion of data is missing.

Regression Imputation

Fit a regression model on complete data.
Predict missing values using the model.
Repeat for each variable with missing values.

The variable with missing values becomes the response variable in this prediction problem.
Can be extended to classification models for categorical variables.
Imputation should exclude the true response variable.

Practical Imputation Strategies

Stratify data when imputing (e.g., by country or demographic groups).
Use visualizations to assess missing data patterns.