Applied ML Case Studies
Seven case studies in regression, classification, ensembles, and deep learning
Kristin Henderson
Summer 2025
Overview
These seven applied case studies in statistical machine learning were completed in a graduate-level course near the end of the SMU MS in Data Science program. Each tackles a different kind of problem with the methods that suit it best: regularized regression for prediction and feature interpretation on a materials dataset, logistic regression with multiple imputation on messy clinical data, a Naive Bayes spam filter, ensemble methods for imbalanced bankruptcy prediction, SVMs and SGD for multi-class network traffic, deep neural networks for high-energy physics event classification on a 7M-event dataset, and cost-sensitive classification on anonymized data, where false positives and false negatives carry different financial costs.
Every case study includes a Jupyter notebook with the full analysis and a LaTeX-typeset report summarizing the methodology, results, and interpretation.
View the full repository on GitHub
Case studies at a glance
| # | Problem | Methods | Dataset | Links |
|---|---|---|---|---|
| 1 | Predict critical temperature of superconductors from material properties | Linear regression, LASSO, Ridge, ElasticNet; cross-validation; residual diagnostics | Superconductor materials (UCI, 21K rows, 82 features) | Notebook Report |
| 2 | Predict hospital readmission risk among diabetic patients (within 30 days, after 30 days, or none) | Logistic regression, multiple imputation, multiclass classification, precision-recall curves | Diabetes hospital readmissions (UCI, 101K encounters, 49 features) | Notebook Report |
| 3 | Classify spam email and group documents by topic | Naive Bayes, bag-of-words, TF-IDF, K-Means clustering | SpamAssassin emails (~9K messages) | Notebook Report |
| 4 | Predict corporate bankruptcy from financial indicators with severe class imbalance | XGBoost, Random Forest, class weighting, stratified cross-validation, ROC/AUC evaluation, hyperparameter tuning | Polish company bankruptcy (UCI, 43K records, 64 features) | Notebook Report |
| 5 | Multi-class classification of firewall actions from network traffic features | SVMs (linear and RBF kernels), SGD-based logistic regression, class weighting, feature scaling | Internet firewall log data (UCI, 65K records, 11 features) | Notebook Report |
| 6 | Distinguish particle-physics signal events from background | Feedforward neural networks in PyTorch and PyTorch Lightning, dropout, learning-rate scheduling | HEPMASS (UCI, ~7M events, 28 features) | Notebook Report |
| 7 | Binary classification with asymmetric misclassification costs | XGBoost, neural network, out-of-fold cross-validation, threshold tuning to minimize total cost | Anonymized binary classification data (160K records, 50 features) | Notebook Report |
What this collection demonstrates
- Method selection. Each problem is matched to a model family that fits its structure: regularization for feature interpretation, imputation for missingness, ensembles for imbalance, neural nets for scale, threshold tuning for cost asymmetry.
- Full analysis cycle. Data cleaning, feature engineering, cross-validation, hyperparameter tuning, diagnostics, and interpretation, all the way to a written report.
- Tool breadth. scikit-learn, XGBoost, PyTorch, PyTorch Lightning, statsmodels, and the standard Python data stack.
- Communication. Every project includes a LaTeX-typeset PDF report aimed at a technical reader.
- Concrete outcomes. A neural network with Swish activation reached 79% accuracy on the HEPMASS independent test set; threshold tuning on a cost-sensitive task cut total misclassification cost by 50% vs. baseline.
Skills
Python · scikit-learn · XGBoost · PyTorch · Machine learning · Statistical modeling