Employee Attrition | Kristin Henderson

Overview

A two-part analysis of employee attrition and compensation, completed for SMU’s MSDS Doing Data Science course. The dataset is a teaching case study framed around Frito-Lay, with 870 employees and 36 variables on demographics, role, tenure, satisfaction, and pay. About 16% of employees left, with attrition concentrated in job level 1, where more than a quarter left.

The first objective is to identify what drives attrition. EDA, t-tests on numerical variables, and chi-square tests on categorical variables narrowed the 36 features to a smaller candidate set. I then compared Naive Bayes and k-nearest neighbors classifiers, tuning over feature combinations, the smoothing parameter (Naive Bayes), the number of neighbors (kNN), and the decision threshold. The top three drivers of attrition were job level, monthly income, and overtime. Naive Bayes was the chosen model and produced the labels for the held-out competition set.

The second objective fits a linear regression for monthly income, comparing forward, backward, and stepwise selection over the full feature set plus two-way interaction terms, evaluated with cross-validated PRESS and held-out RMSE. The final ten-term model predicts monthly income within about $1,000 on held-out data.

View the full analysis notebook

View on GitHub

Interactive companion

First load may take 10 to 30 seconds while the app wakes (shinyapps.io free tier).

Explore monthly income by job role and attrition group. Built in R Shiny. Open the app in a new tab for a full-size view.

Skills

R · R Shiny · EDA · Naive Bayes · k-NN · Linear regression · Variable selection · Hypothesis testing · Cross-validation

Kristin Henderson

Spring 2024

Overview

Skills