Applied ML Case Studies | Kristin Henderson

Overview

These seven applied case studies in statistical machine learning were completed in a graduate-level course near the end of the SMU MS in Data Science program. Each tackles a different kind of problem with the methods that suit it best: regularized regression for prediction and feature interpretation on a materials dataset, logistic regression with multiple imputation on messy clinical data, a Naive Bayes spam filter, ensemble methods for imbalanced bankruptcy prediction, SVMs and SGD for multi-class network traffic, deep neural networks for high-energy physics event classification on a 7M-event dataset, and cost-sensitive classification on anonymized data, where false positives and false negatives carry different financial costs.

Every case study includes a Jupyter notebook with the full analysis and a LaTeX-typeset report summarizing the methodology, results, and interpretation.

View the full repository on GitHub

Case studies at a glance

#	Problem	Methods	Dataset	Links
1	Predict critical temperature of superconductors from material properties	Linear regression, LASSO, Ridge, ElasticNet; cross-validation; residual diagnostics	Superconductor materials (UCI, 21K rows, 82 features)	Notebook Report
2	Predict hospital readmission risk among diabetic patients (within 30 days, after 30 days, or none)	Logistic regression, multiple imputation, multiclass classification, precision-recall curves	Diabetes hospital readmissions (UCI, 101K encounters, 49 features)	Notebook Report
3	Classify spam email and group documents by topic	Naive Bayes, bag-of-words, TF-IDF, K-Means clustering	SpamAssassin emails (~9K messages)	Notebook Report
4	Predict corporate bankruptcy from financial indicators with severe class imbalance	XGBoost, Random Forest, class weighting, stratified cross-validation, ROC/AUC evaluation, hyperparameter tuning	Polish company bankruptcy (UCI, 43K records, 64 features)	Notebook Report
5	Multi-class classification of firewall actions from network traffic features	SVMs (linear and RBF kernels), SGD-based logistic regression, class weighting, feature scaling	Internet firewall log data (UCI, 65K records, 11 features)	Notebook Report
6	Distinguish particle-physics signal events from background	Feedforward neural networks in PyTorch and PyTorch Lightning, dropout, learning-rate scheduling	HEPMASS (UCI, ~7M events, 28 features)	Notebook Report
7	Binary classification with asymmetric misclassification costs	XGBoost, neural network, out-of-fold cross-validation, threshold tuning to minimize total cost	Anonymized binary classification data (160K records, 50 features)	Notebook Report

What this collection demonstrates

Method selection. Each problem is matched to a model family that fits its structure: regularization for feature interpretation, imputation for missingness, ensembles for imbalance, neural nets for scale, threshold tuning for cost asymmetry.
Full analysis cycle. Data cleaning, feature engineering, cross-validation, hyperparameter tuning, diagnostics, and interpretation, all the way to a written report.
Tool breadth. scikit-learn, XGBoost, PyTorch, PyTorch Lightning, statsmodels, and the standard Python data stack.
Communication. Every project includes a LaTeX-typeset PDF report aimed at a technical reader.
Concrete outcomes. A neural network with Swish activation reached 79% accuracy on the HEPMASS independent test set; threshold tuning on a cost-sensitive task cut total misclassification cost by 50% vs. baseline.

Skills

Python · scikit-learn · XGBoost · PyTorch · Machine learning · Statistical modeling

Kristin Henderson

Summer 2025

Overview

Case studies at a glance

What this collection demonstrates

Skills