Applied ML Case Studies

Seven case studies in regression, classification, ensembles, and deep learning

Overview

These seven applied case studies in statistical machine learning were completed in a graduate-level course near the end of the SMU MS in Data Science program. Each tackles a different kind of problem with the methods that suit it best: regularized regression for prediction and feature interpretation on a materials dataset, logistic regression with multiple imputation on messy clinical data, a Naive Bayes spam filter, ensemble methods for imbalanced bankruptcy prediction, SVMs and SGD for multi-class network traffic, deep neural networks for high-energy physics event classification on a 7M-event dataset, and cost-sensitive classification on anonymized data, where false positives and false negatives carry different financial costs.

Every case study includes a Jupyter notebook with the full analysis and a LaTeX-typeset report summarizing the methodology, results, and interpretation.

View the full repository on GitHub

Case studies at a glance

# Problem Methods Dataset Links
1 Predict critical temperature of superconductors from material properties Linear regression, LASSO, Ridge, ElasticNet; cross-validation; residual diagnostics Superconductor materials (UCI, 21K rows, 82 features)
2 Predict hospital readmission risk among diabetic patients (within 30 days, after 30 days, or none) Logistic regression, multiple imputation, multiclass classification, precision-recall curves Diabetes hospital readmissions (UCI, 101K encounters, 49 features)
3 Classify spam email and group documents by topic Naive Bayes, bag-of-words, TF-IDF, K-Means clustering SpamAssassin emails (~9K messages)
4 Predict corporate bankruptcy from financial indicators with severe class imbalance XGBoost, Random Forest, class weighting, stratified cross-validation, ROC/AUC evaluation, hyperparameter tuning Polish company bankruptcy (UCI, 43K records, 64 features)
5 Multi-class classification of firewall actions from network traffic features SVMs (linear and RBF kernels), SGD-based logistic regression, class weighting, feature scaling Internet firewall log data (UCI, 65K records, 11 features)
6 Distinguish particle-physics signal events from background Feedforward neural networks in PyTorch and PyTorch Lightning, dropout, learning-rate scheduling HEPMASS (UCI, ~7M events, 28 features)
7 Binary classification with asymmetric misclassification costs XGBoost, neural network, out-of-fold cross-validation, threshold tuning to minimize total cost Anonymized binary classification data (160K records, 50 features)

What this collection demonstrates

  • Method selection. Each problem is matched to a model family that fits its structure: regularization for feature interpretation, imputation for missingness, ensembles for imbalance, neural nets for scale, threshold tuning for cost asymmetry.
  • Full analysis cycle. Data cleaning, feature engineering, cross-validation, hyperparameter tuning, diagnostics, and interpretation, all the way to a written report.
  • Tool breadth. scikit-learn, XGBoost, PyTorch, PyTorch Lightning, statsmodels, and the standard Python data stack.
  • Communication. Every project includes a LaTeX-typeset PDF report aimed at a technical reader.
  • Concrete outcomes. A neural network with Swish activation reached 79% accuracy on the HEPMASS independent test set; threshold tuning on a cost-sensitive task cut total misclassification cost by 50% vs. baseline.


Skills

Python · scikit-learn · XGBoost · PyTorch · Machine learning · Statistical modeling