This project aims to understand the risk of bone fractures for women with osteoporosis.
The first objective is to perform an exploratory data analysis and produce a logistic regression model to identify and interpret factors important in determining the probability of suffering a fracture in the first year of study enrollment.
The second objective is to produce three additional models that prioritize prediction performance, including complex logistic regression, discriminant analysis, and non-parametric models.
The models will be compared using appropriate metrics, such as sensitivity, specificity, prevalence, PPV, NPV, AUROC and the logloss error.
This loads the necessary libraries.
# install.packages("aplore3")
library(aplore3) #data set
library(tidyverse) #included ggplot2,dplyr,tidyr,stringr,forcats
library(gt) #tables
library(DataExplorer) #plot_missing()
library(naniar) #vis_miss()
library(Hmisc)
library(RColorBrewer)
library(gridExtra)
library(GGally)
library(corrplot)
library(pheatmap)
library(caret)
library(car) #vif()
library(Metrics)
library(pROC)
library(ResourceSelection) # For hoslem.test
library(reshape2)
library(sjPlot) #effects plots
library(sjmisc) #effects plots
library(dplyr) #loading last so select() works
This imports and quickly views the data.
# Import the data
bone = glow_bonemed
# View the data
str(bone)
'data.frame': 500 obs. of 18 variables:
$ sub_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ site_id : int 1 4 6 6 1 5 5 1 1 4 ...
$ phy_id : int 14 284 305 309 37 299 302 36 8 282 ...
$ priorfrac : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 2 2 1 ...
$ age : int 62 65 88 82 61 67 84 82 86 58 ...
$ weight : num 70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
$ height : int 158 160 157 160 152 161 150 153 156 166 ...
$ bmi : num 28.2 34 20.6 24.3 29.4 ...
$ premeno : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ momfrac : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
$ armassist : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
$ smoke : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
$ raterisk : Factor w/ 3 levels "Less","Same",..: 2 2 1 1 2 2 1 2 2 1 ...
$ fracscore : int 1 2 11 5 1 4 6 7 7 0 ...
$ fracture : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ bonemed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
$ bonemed_fu: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
$ bonetreat : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
# ?glow_bonemed
The data set consists of 18 variables collected on each of 500
patients. The binary categorical response variable
fracture, indicates if a patient had a fracture within the
first year of enrollment. Each patient has a unique ID and is assigned
to a study site and physician. The other 14 variables relate to the
medical history or treatment of the patient. Aside from the patient ID,
five of the variables are numeric and ten are categorical. The remaining
two, site and physician are integer coded categorical variables. A more
detailed data description is in the table below.
| Bone Health Data Description | |||
| Variable ID | Variable Name | Data Type | Description |
|---|---|---|---|
| sub_id | ID | numeric identifier | Unique Identification Code |
| site_id | Study Site | categorical integer | Study Site (1 - 6) |
| phy_id | Physician ID | categorical integer | Physician ID Code (128 unique codes) |
| priorfrac | Prior Fracture | categorical | History of Prior Fracture (1: No, 2: Yes) |
| age | Age | numeric | Age at Enrollment (years) |
| weight | Weight | numeric | Weight at Enrollment (kg) |
| height | Height | numeric | Height at Enrollment (cm) |
| bmi | BMI | numeric | Body Mass Index (kg/m^2) |
| premeno | Early Menopause | categorical | Menopause before Age 45 (1: No, 2: Yes) |
| momfrac | Mom Fracture | categorical | Mother had Hip Fracture (1: No, 2: Yes) |
| armassist | Arms Assist | categorical | Arms are needed to stand from a chair (1: No, 2: Yes) |
| smoke | Smoker | categorical | Former or current smoker (1: No, 2: Yes) |
| raterisk | Self Reported Risk | categorical | Self-reported risk of fracture (1: Less than others, 2: Same as others, 3: Greater than others) |
| fracscore | Fracture Risk | numeric | Fracture Risk Score (Composite Risk Score) |
| fracture | Fracture First Year | categorical | Any fracture in first year (1: No, 2: Yes) |
| bonemed | Meds at Enrollment | categorical | Bone medications at enrollment (1: No, 2: Yes) |
| bonemed_fu | Meds at Follow-Up | categorical | Bone medications at follow-up (1: No, 2: Yes) |
| bonetreat | Meds at Both | categorical | Bone medications both at enrollment and follow-up (1: No, 2: Yes) |
This looks for missing data.
# Check for missing data
# If any, what kind of missing are they: mcar, mar, nmar?
missing = bone
# Check for empty strings and if any convert to NA
sum(missing == "")
[1] 0
# Check for NAs in each column
colSums(is.na(missing))
sub_id site_id phy_id priorfrac age weight height
0 0 0 0 0 0 0
bmi premeno momfrac armassist smoke raterisk fracscore
0 0 0 0 0 0 0
fracture bonemed bonemed_fu bonetreat
0 0 0 0
# Plot the percent missing for PPT
miss_plot = plot_missing(missing)
# Save the plot
# ggsave("missing_data.png", miss_plot)
The data set is complete. There are no missing values.
This outputs the summary statistics, including the five number plus mean summaries of the numerical variables and count summaries of the categorical variable levels. It also notes the prevalence of the positive response class.
# Get summary statistics
summary_stats = summary(bone)
print(summary_stats)
sub_id site_id phy_id priorfrac age
Min. : 1.0 Min. :1.000 Min. : 1.00 No :374 Min. :55.00
1st Qu.:125.8 1st Qu.:2.000 1st Qu.: 57.75 Yes:126 1st Qu.:61.00
Median :250.5 Median :3.000 Median :182.50 Median :67.00
Mean :250.5 Mean :3.436 Mean :178.55 Mean :68.56
3rd Qu.:375.2 3rd Qu.:5.000 3rd Qu.:298.00 3rd Qu.:76.00
Max. :500.0 Max. :6.000 Max. :325.00 Max. :90.00
weight height bmi premeno momfrac armassist
Min. : 39.90 Min. :134.0 Min. :14.88 No :403 No :435 No :312
1st Qu.: 59.90 1st Qu.:157.0 1st Qu.:23.27 Yes: 97 Yes: 65 Yes:188
Median : 68.00 Median :161.5 Median :26.42
Mean : 71.82 Mean :161.4 Mean :27.55
3rd Qu.: 81.30 3rd Qu.:165.0 3rd Qu.:30.79
Max. :127.00 Max. :199.0 Max. :49.08
smoke raterisk fracscore fracture bonemed bonemed_fu
No :465 Less :167 Min. : 0.000 No :375 No :371 No :361
Yes: 35 Same :186 1st Qu.: 2.000 Yes:125 Yes:129 Yes:139
Greater:147 Median : 3.000
Mean : 3.698
3rd Qu.: 5.000
Max. :11.000
bonetreat
No :382
Yes:118
prop_frac = (sum(bone$fracture == 'Yes') / length(bone$fracture)) * 100
print(prop_frac)
[1] 25
The prevalence of the positive response class is 25%.
Here is a table of the summary statistics.
| Summary Statistics for Numeric Variables | ||||||
| Variable | Min | 1st Quartile | Median | Mean | 3rd Quartile | Max |
|---|---|---|---|---|---|---|
| sub_id | 1.00 | 125.80 | 250.50 | 250.50 | 375.20 | 500.00 |
| site_id | 1.00 | 2.00 | 3.00 | 3.44 | 5.00 | 6.00 |
| phy_id | 1.00 | 57.75 | 182.50 | 178.55 | 298.00 | 325.00 |
| age | 55.00 | 61.00 | 67.00 | 68.56 | 76.00 | 90.00 |
| weight | 39.90 | 59.90 | 68.00 | 71.82 | 81.30 | 127.00 |
| height | 134.00 | 157.00 | 161.50 | 161.40 | 165.00 | 199.00 |
| bmi | 14.88 | 23.27 | 26.42 | 27.55 | 30.79 | 49.08 |
| fracscore | 0.00 | 2.00 | 3.00 | 3.70 | 5.00 | 11.00 |
| Summary Statistics for Categorical Variables | |||||
| Variable |
Binary Variables
|
Multicategory Variable
|
|||
|---|---|---|---|---|---|
| No | Yes | Less | Same | Greater | |
| priorfrac | 374 | 126 | |||
| premeno | 403 | 97 | |||
| momfrac | 435 | 65 | |||
| armassist | 312 | 188 | |||
| smoke | 465 | 35 | |||
| fracture | 375 | 125 | |||
| bonemed | 371 | 129 | |||
| bonemed_fu | 361 | 139 | |||
| bonetreat | 382 | 118 | |||
| raterisk | 167 | 186 | 147 | ||
This makes separate dataframes for the numerical variables,
bone_num and for the categorical variables,
bone_cat.
# Filter numerical variables
numVars = sapply(bone, is.numeric) # outputs a logical
numVars["sub_id"] = FALSE # set patient ID column to FALSE to exclude it
bone_num = bone[, numVars]
# Filter categorical variables (factors)
bone_cat = bone %>%
select(where(is.factor))
This converts the response levels to numerical values and creates both box plots of the distribution for each level and LOESS plots of the relationship between the predictor and the response.
# Convert the binary response to No = 0, Yes = 1.
bone$fracture.num = ifelse(bone$fracture == "Yes", 1, 0)
# Sanity check
table(bone$fracture.num, bone$fracture)
No Yes
0 375 0
1 0 125
# Plots of the response versus each predictor variable
# NUMERICAL VARIABLES
# Function to create LOESS plots for numerical variables and response
make_loess = function(numvar, response, data) {
data %>%
ggplot(aes(x = .data[[numvar]], y = .data[[response]])) +
geom_point(position = position_jitter(height = 0.05), alpha = 0.6) +
geom_smooth(formula = "y~x", method = "loess", span = 0.8) +
ylim(-.2,1.2) +
labs(title = paste(numvar, "vs fracture"),
y = "proportion with fracture") +
theme_bw()
}
# Function to create box plots for numerical variables by response category
make_boxplot <- function(numvar, response, data) {
data %>%
ggplot(aes(x = .data[[numvar]], y = .data[[response]], fill = .data[[response]])) +
geom_boxplot(alpha = 0.7) +
labs(title = paste(numvar, "vs", response)) +
scale_fill_manual(values = c("blue3", "red3")) +
theme_bw() +
theme(legend.position = 'none')
}
# Create LOESS plots and boxplots for numerical variables vs "fracture"
for (numvar in names(bone_num)) {
loess_plt <- make_loess(numvar, "fracture.num", bone)
boxplt = make_boxplot(numvar, "fracture", bone)
print(loess_plt)
print(boxplt)
# ggsave(filename = paste0("figures/loessPlt_", numvar, ".png"), loess_plt)
# ggsave(filename = paste0("figures/boxPlt_", numvar, ".png"), boxplt)
}
This takes a quick look at the potential height outliers/high leverage points and replots the LOESS and boxplot without those values to re-examine the trend.
# Find indices of rows where height < 140 or height > 190
ht_outlier_idx <- which(bone$height < 140 | bone$height > 190)
# View the rows that match the condition
bone[ht_outlier_idx, ]
sub_id site_id phy_id priorfrac age weight height bmi premeno momfrac
281 281 4 284 No 57 70.3 199 17.75208 No No
430 430 2 76 No 77 74.8 134 41.65739 No No
armassist smoke raterisk fracscore fracture bonemed bonemed_fu bonetreat
281 No No Greater 0 No No No No
430 Yes No Same 6 Yes Yes Yes Yes
fracture.num
281 0
430 1
bone_noHtOut = bone[-ht_outlier_idx, ]
# Create LOESS plots and boxplots
loess_plt = make_loess("height", "fracture.num", bone_noHtOut)
boxplt = make_boxplot("height", "fracture", bone_noHtOut)
print(loess_plt)
print(boxplt)
# ggsave(filename = paste0("figures/loessPlt_heightNoOut.png"), loess_plt)
# ggsave(filename = paste0("figures/boxPlt_heightNoOut.png"), boxplt)
This creates bar plots for each categorical variable with the proportion of fractures for level.
# CATEGORICAL VARIABLES
# Function to get proportions of each response class for categorical variable levels
summarize_props <- function(catvar, response, data) {
data %>%
group_by(.data[[catvar]], .data[[response]]) %>%
summarise(cnt = n(), .groups = "drop") %>%
group_by(.data[[catvar]]) %>% # Group again to calculate proportions within each level of `catvar`
mutate(perc = round(cnt / sum(cnt), 4)) %>%
ungroup() %>% # Ensure no lingering groupings
arrange(.data[[response]], .data[[catvar]])
}
# Function to create bar plots displaying proportion of positive response class for each categorical variable level
make_barplot <- function(catvar, response, prop_data) {
# Filter rows with "Yes" response for standard variables
if (catvar == "raterisk") {
plot_data <- prop_data %>% filter(.data[[response]] == "Yes")
} else {
plot_data <- prop_data %>% filter(.data[[response]] == "Yes")
}
# Create bar plots
plot_data %>%
ggplot(aes(x = reorder(.data[[catvar]], -perc), y = perc, fill = .data[[catvar]])) +
geom_bar(stat = "identity", show.legend = TRUE) +
ylab(paste("Proportion with", capitalize(response))) +
xlab(catvar) +
scale_fill_brewer(palette = "Dark2") +
theme_bw() +
theme(legend.position = "none")
}
# Call the functions to calculate proportions and make bar plots for categorical variables
for (catvar in names(bone_cat)[names(bone_cat) != "fracture"]) {
# Get proportion summaries
prop_summary <- summarize_props(catvar, "fracture", bone)
# Create bar plot
barplt <- make_barplot(catvar, "fracture", prop_summary)
# Display the plot
print(barplt)
# Save the plot
# ggsave(filename = paste0("figures/barPlt_", catvar, ".png"), barplt)
}
# For debugging:
# prop.summary<-bone %>%
# group_by(priorfrac,fracture) %>%
# summarise(cnt=n()) %>%
# mutate(perc=round(cnt/sum(cnt),4))%>%
# arrange(desc(perc))
# prop.summary
Based on graphical evidence, the variables that I think are the most likely to influence the probability of fracture are age, fracscore, and priorfrac. Variables which may be useful are armassist, momfrac, raterisk, and height. bonemed, bonemed_fu, and bonetreat may be useful but confounding. The variables which seem unimportant are site_id, phy_id, weight, bmi, premeno, and smoke.
This runs a PCA to examine the influential explanatory variables in each of the PCs and for use in a scatterplot matrix. Also, it is just for practice.
str(bone_num)
'data.frame': 500 obs. of 7 variables:
$ site_id : int 1 4 6 6 1 5 5 1 1 4 ...
$ phy_id : int 14 284 305 309 37 299 302 36 8 282 ...
$ age : int 62 65 88 82 61 67 84 82 86 58 ...
$ weight : num 70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
$ height : int 158 160 157 160 152 161 150 153 156 166 ...
$ bmi : num 28.2 34 20.6 24.3 29.4 ...
$ fracscore: int 1 2 11 5 1 4 6 7 7 0 ...
# Run PCA on the numeric variables
pc.result<-prcomp(bone_num[, -(1:2)],scale.=TRUE) # remove site_id and phy_id
#Eigen Vectors
pc.result$rotation
PC1 PC2 PC3 PC4 PC5
age 0.4947219 0.46742140 -0.15246583 0.71654567 -0.009160237
weight -0.5273035 0.46578775 -0.08840991 0.03240244 -0.704362523
height -0.2345770 -0.08196149 -0.93823245 0.01885633 0.240042129
bmi -0.4741030 0.51615173 0.24533677 0.05137380 0.667820563
fracscore 0.4442985 0.53984137 -0.16872342 -0.69463484 0.013601399
# Append the diagnosis to the new dataset of PCs for each observation
pc.scores = pc.result$x
pc.scores = data.frame(fracture=bone$fracture, pc.scores)
head(pc.scores)
fracture PC1 PC2 PC3 PC4 PC5
1 No -0.7166719 -0.8721532 0.82346719 0.2202093 -0.001884542
2 No -1.4516177 0.4570325 0.56013019 0.2704474 0.011489794
3 No 3.7564133 1.4509041 -0.35122087 -0.5973179 -0.020109479
4 No 1.5951482 0.4377183 -0.19760955 0.6571483 -0.009795756
5 No -0.5773744 -0.8020883 1.79077290 0.1290996 0.013233417
6 No 0.2086149 -0.2335424 0.02618618 -0.2285344 0.005830377
# Plot PC1 and PC2 color coded by diagnosis
ggplot(data = pc.scores, aes(x = PC1, y = PC2, color = fracture)) +
geom_point() +
scale_color_manual(values = c("blue","red"))
# Plot the first few PCs in a scatterplot matrix
ggpairs(
pc.scores,
aes(color = fracture, alpha = 0.5),
columns = c(2:6)) +
scale_color_manual(values = c("blue","red")) +
scale_fill_manual(values = c("blue", "red"))
I can’t take much away from this PCA. However, what it seems to suggest is that, unsurprisingly, the categorical predictors are very important in separating the response classes. Also, though I haven’t looked very deeply into the possible outliers in the height variable, height seems unimportant in likelihood of fracture. PC3 is heavily influenced by height, and both PC3 and PC5 have virtually no separation in fracture class distributions. Across the other PCs, height has small loadings and seems unimportant.
Here, I’d like to make some plots to get a sense of potential multicollinearity and/or interactions. First, I plot each numerical and categorical variable combination.
# Look at co-variation between numerical and categorical variables
# Function to create LOESS plots with color split by categorical predictor
make_loess_split <- function(numvar, catvar, response, data) {
data %>%
ggplot(aes(x = .data[[numvar]], y = .data[[response]], colour = .data[[catvar]])) +
geom_point(position = position_jitter(height = 0.05), alpha = 0.6) + # Add jitter to points
geom_smooth(method = "loess", formula = "y ~ x", linewidth = 1, span = 0.9) + # Add LOESS curve
ylim(-0.3, 1.3) + # Set consistent y-limits
labs(title = paste(numvar, "vs", response, "by", catvar),
x = numvar,
y = response,
colour = catvar) +
scale_color_brewer(palette = "Dark2") +
theme_bw()
}
# Loop through numerical and categorical predictors
for (numvar in names(bone_num)) {
for (catvar in names(bone_cat)[names(bone_cat) != "fracture"]) {
# Create LOESS plot
loess_plt <- make_loess_split(numvar, catvar, "fracture.num", bone)
# Display the plot
print(loess_plt)
# Save the plot
# ggsave(filename = paste0("figures/loessPltCombo_", numvar, "_by_", catvar, ".png"), loess_plt)
}
}
Based on graphical evidence, the variables that I think are safe to drop from the analysis are site_id and phy_id. There are quite a number of potential interactions between numerical and categorical variables to explore: age/fracscore:priorfrac, weight/bmi with other categorical (meds/treatments, raterisk, priorfrac), height:smoke.
Additionally, there are multicollinearity concerns between age/fracscore/premeno and fracscore/armassist.
Two variables are essential interaction terms of other combinations of variable. bonetreat is a combination of bonemed and bonemed_fu. bmi is weight (\(kg\)) divided by height squared (\(m^2\)).
Next, I plot the proportion of fractures in each pairwise combination of categorical predictors.
make_barchart <- function(var1, var2) {
bone_cat %>%
group_by(.data[[var1]], .data[[var2]], fracture) %>% # Group by predictors and response
summarise(n = n(), .groups = "drop") %>% # Count occurrences for each group
group_by(.data[[var1]], .data[[var2]]) %>% # Group by predictors only
mutate(total = sum(n), # Calculate the total count for each group
prop = n / total, # Calculate the proportion
label = paste0(n, "/", total)) %>% # Create the label as "count/total"
filter(fracture == "Yes") %>% # Keep only proportions for "Yes" fractures
ggplot(aes(x = .data[[var2]], y = prop, fill = .data[[var1]])) + # x-axis: var2, fill by var1
geom_bar(stat = "identity", position = "dodge") + # Use dodge position for comparison
geom_text(aes(label = label), # Add the "count/total" label
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3) + # Adjust vertical position and size
facet_wrap(~fracture, scales = "free", labeller = labeller(fracture = c("Yes" = "Yes Fracture"))) +
labs(title = paste("Proportion of Fractures (Yes) by", var1, "and", var2),
x = var2,
y = "Proportion with Fractures (Yes)",
fill = var1) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Dark2")
}
catpredictorNames <- names(bone_cat %>% select(-fracture)) # Exclude the response variable
# Loop through all pairs of categorical predictors
for (i in 1:(length(catpredictorNames) - 1)) {
var1 <- catpredictorNames[i]
for (j in (i + 1):length(catpredictorNames)) {
var2 <- catpredictorNames[j]
# Generate the bar plot
barplt <- make_barchart(var1, var2)
# Display the plot
print(barplt)
# Save the plot
filename <- paste0("figures/barpltCombo_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = barplt, device = "png")
}
}
For potential interactions, I looked at the patterns between the bars compared across category levels.
There are some variables that may confound one another. Based on the visual evidence, membership in one variables group tells you something about the relationship with the other variable.
Here, I create pairwise scatterplots of the numerical predictors split by response group.
# Function to create scatter plots for numerical variables with grouping by fracture
make_scatterplot <- function(var1, var2, group_var) {
bone %>%
ggplot(aes(x = .data[[var1]], y = .data[[var2]], color = .data[[group_var]])) +
geom_point(alpha = 0.5, position = position_jitter(width = 0.2, height = 0.2)) + # Jitter for visibility
geom_smooth(formula = "y~x", method = "loess", se = TRUE) + # Add LOESS smoothing line
labs(
title = paste(var1, "vs", var2, "by", group_var),
x = var1,
y = var2,
color = group_var
) +
scale_color_manual(values = c("blue3", "red3")) +
theme_bw() +
theme(legend.position = "bottom") # Adjust legend position for clarity
}
# Get the names of numerical variables
num_var_names <- names(bone_num)
# Loop through each pair of numerical variables
for (i in 1:(length(num_var_names) - 1)) {
var1 <- num_var_names[i]
for (j in (i + 1):length(num_var_names)) {
var2 <- num_var_names[j]
# Create scatter plot
scatter <- make_scatterplot(var1, var2, group_var = "fracture")
# Display the plot
print(scatter)
# Save the plot
filename <- paste0("figures/scatterCombo_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = scatter, device = "png")
}
}
From the paired scatterplots, there is no strong evidence of interactions between numerical variables. The trend between the fracture groups look consistent. However, there are some multicollinearity concerns between age and fracscore and weight and bmi.
Lastly, I confirm multicollinearity suspicions among numerical predictors by creating a correlation and scatterplot matrix.
# This makes a correlation matrix of the numerical variables to confirm suspicions of multicollinearity issues indicated by the scatterplot matrices.
# Calculate correlation matrix for numeric variables
cor_matrix = cor(bone_num)
# Save the correlation plot
# png("figures/correlation_matrix.png", width = 10, height = 8, units = "in", res = 300)
# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7)
# dev.off() # Close the device to finalize the file for saving
# Combine numerical predictors with the fracture variable
bone_num_with_fracture <- cbind(bone_num, fracture = bone$fracture)
bone_num_with_fracture$fracture <- factor(bone_num_with_fracture$fracture, levels = c("Yes", "No"))
# Create scatterplot matrix
scatterplot_matrix <- ggpairs(
bone_num_with_fracture,
aes(color = fracture),
lower = list(
continuous = wrap("smooth_loess", se = FALSE, alpha = 0.4) # Scatterplot with LOESS lines
),
diag = list(
continuous = wrap("densityDiag", alpha = 0.6) # Density plots on the diagonal
)) +
scale_color_brewer(palette = 'Set1') +
scale_fill_brewer(palette = 'Set1')
scatterplot_matrix
# Optionally save the plot
# ggsave(filename = "figures/scatterplot_matrix.png", plot = scatterplot_matrix, width = 10, height = 8, dpi = 300)
Multicollinearity concerns:
There is high correlation between bmi and weight and between age and
fracscore. Consider the VIFs or consider only including one variable
from each pair in the model.
Possible interactions:
The pairwise scatterplot of numerical variable do not suggest there is
different behavior between the response classes. The relationships look
very similar.
Summarizing EDA:
For the rest of the model fitting and analysis, I will split the data into a training and validation set.
bone_subset = bone %>% select(-sub_id, -fracture.num) # remove columns prior to feature selection
#Train/Validation Split (team decision to use 70/30 split and seed `1234`)
set.seed(1234) #consider trying different seeds/splits to get a feel for how stable the models are
trainIndex<-createDataPartition(bone$fracture,p=.7,list=F) #p: proportion of data in train; original models built with .8
training<-bone_subset[trainIndex,]
validate<-bone_subset[-trainIndex,]
Here, I use penalized regression for feature selection and stepwise feature selection. I’d like to get a sense of the important predictors. I chose to remove the metadata variables site_id and phy_id and the interaction variables bmi and bonetreat. I commented out the code for stepwise selection, so the output would be less overwhelming.
# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
# GLMNET
set.seed(1234)
glmnet.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)
14 x 1 sparse Matrix of class "dgCMatrix"
s=0.02283267
(Intercept) 0.74937631
priorfracYes 0.44011362
age .
weight .
height -0.01645137
premenoYes .
momfracYes 0.51513183
armassistYes .
smokeYes .
rateriskSame .
rateriskGreater .
fracscore 0.13853960
bonemedYes .
bonemed_fuYes 0.17244372
# includes priorfrac, height, momfrac, fracscore, bonemed_fu
# LASSO
set.seed(1234)
lasso.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss",
tuneGrid = expand.grid(alpha = 1, # Lasso penalty
lambda = seq(0.01, 10, length = 200)) # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)
14 x 1 sparse Matrix of class "dgCMatrix"
s=0.06020101
(Intercept) -1.5253090
priorfracYes 0.1442626
age .
weight .
height .
premenoYes .
momfracYes .
armassistYes .
smokeYes .
rateriskSame .
rateriskGreater .
fracscore 0.1009059
bonemedYes .
bonemed_fuYes .
# includes only priorfrac and fracscore
#STEPWISE
# set.seed(1234)
# #note CV and error metric are not really used here, but logLoss is reported for the final model.
# Step.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
# data=training,
# method="glmStepAIC",
# trControl=fitControl,
# metric="logLoss")
# coef(Step.fit$finalModel)
# # priorfrac, age, height, momfrac, armassist
# AUROC Curve metric code: twoClassSummary and "ROC" options
fitControl2<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=twoClassSummary)
# LASSO
set.seed(1234)
lassoroc.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
data=training,
method="glmnet",
trControl=fitControl2,
metric="ROC",
tuneGrid = expand.grid(alpha = 1, # Lasso penalty
lambda = seq(0.01, 1, length = 200)) # Range of lambda values to test
)
coef(lassoroc.fit$finalModel,lassoroc.fit$finalModel$lambdaOpt)
14 x 1 sparse Matrix of class "dgCMatrix"
s=0.07964824
(Intercept) -1.371835977
priorfracYes 0.002748291
age .
weight .
height .
premenoYes .
momfracYes .
armassistYes .
smokeYes .
rateriskSame .
rateriskGreater .
fracscore 0.072403610
bonemedYes .
bonemed_fuYes .
# priorfrac, fracscore
# STEPWISE
# set.seed(1234)
# Steproc.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
# data=training,
# method="glmStepAIC",
# trControl=fitControl2,
# metric="ROC")
# coef(Step.fit$finalModel)
# # priorfrac, age, height, momfrac, armassist
Originally, I ran the feature selection code above with all variables except the patient id. All variables in the methods above, 3 using a logloss error metric and 2 using AUROC produced these models:
GLMNET which includes all variables except phy_id, weight,
fracscore
LASSO (logloss) removes phy_id, weight, smokeYes, bonemedYes,
bonetreatYes
LASSO (ROC) includes priorfracYes, height, momfracYes, fracscore,
bonemed_fuYes
Stepwise (logloss) includes priorfracYes, age, weight, bmi,
momfracYes, bonemedYes, bonemed_fuYes, bonetreatYes
Stepwise (ROC) includes priorfracYes, age, weight, bmi, momfracYes, bonemedYes, bonemed_fuYes, bonetreatYes
Frequency of variables in the feature selection models:
After additional thought, I removed the site and physician id variables and bmi and bonetreat and reran the feature selection.
GLMNET includes priorfrac, height, momfrac, fracscore,
bonemed_fu
LASSO (logloss & ROC) includes only priorfrac and
fracscore
Stepwise (logloss & ROC) includes priorfrac, age, height, momfrac, armassist
Frequency of variables in the feature selection models:
Here, I explore heatmap clustering mostly just for practice. With so few numerical predictors, I don’t expect it to be very useful for this analysis.
# Just playing around with making a heatmap of the numerical variables
str(bone_num_with_fracture)
'data.frame': 500 obs. of 8 variables:
$ site_id : int 1 4 6 6 1 5 5 1 1 4 ...
$ phy_id : int 14 284 305 309 37 299 302 36 8 282 ...
$ age : int 62 65 88 82 61 67 84 82 86 58 ...
$ weight : num 70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
$ height : int 158 160 157 160 152 161 150 153 156 166 ...
$ bmi : num 28.2 34 20.6 24.3 29.4 ...
$ fracscore: int 1 2 11 5 1 4 6 7 7 0 ...
$ fracture : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(1234)
trainIndex<-createDataPartition(bone_num_with_fracture$fracture,p=.8,list=F) #p: proportion of data in train
dat.train<-bone_num_with_fracture[trainIndex,]
dat.val<-bone_num_with_fracture[-trainIndex,]
#Seperating predictors from response
dat.train.x <- dat.train[,1:7]
dat.train.y <- dat.train$fracture
#Using smaller set of predictors to show clustering results changing
pheatmap(dat.train.x[,3:7],fontsize_row=3,annotation_row=data.frame(fracture=dat.train.y),scale="column",legend=T,color=colorRampPalette(c("blue3","white", "red3"), space = "rgb")(100),annotation_colors=list(fracture=c("No"="white","Yes"="darkgreen")))
Here, I fit a bunch of models based on the EDA, feature selection, and intuition. I compare the number of predictors, logloss error from both the k-fold cross-validation and validation set, AUROC, sensitivity, specificity, PPV, NPV, and prevalence based on a threshold calculated to achieve the highest sum of sensitivity and specificity.
# List of model formulas
models <- list(
# these are the models I worked on with my first feature selection
# fit5 = fracture ~ priorfrac + momfrac + bonemed_fu,
# fit54a = fracture ~ priorfrac + momfrac + bonemed_fu + age,
# fit54b = fracture ~ priorfrac + momfrac + bonemed_fu + bmi,
# fit54 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi,
fit543a = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height,
# fit543b = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonemed,
# fit543c = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat,
# fit543ca = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonetreat,
fit5432c1 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonetreat + armassist,
# fit5432e2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + fracscore,
# fit5432ce2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + armassist + fracscore, # high VIF
# fit5432be2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + fracscore,
fit5432de2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + raterisk + fracscore,
fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore,
# fit5432bcde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + armassist + raterisk + fracscore,
# fit54321abcde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + armassist + raterisk + fracscore + smoke, # high VIF
# fitglm = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonemed + bonetreat + premeno + armassist + raterisk + smoke,
# fitstep = fracture ~ priorfrac + age + weight + bmi + momfrac + bonemed + bonemed_fu + bonetreat,
fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
# these I decided to try after thinking more about the variables that should be excluded
# fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
fit_noage = fracture ~ . -site_id -phy_id -bmi -bonetreat -age, # reduce multicollinearity
fit_noscore = fracture ~ . -site_id -phy_id -bmi -bonetreat -fracscore, # reduce multicollinearity
fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist, # 2nd feature sel model no age,
fit_fsnoscore = fracture ~ priorfrac + height + momfrac + bonemed_fu + age + armassist # 2nd feature sel model no age
)
# Set seed for reproducibility
set.seed(1234)
# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
model_formula <- models[[model_name]]
train(
model_formula,
data = training,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "logLoss"
)
})
# Predict probabilities and calculate metrics for validation set
results <- lapply(seq_along(models), function(i) {
model_name <- names(models)[i]
model_formula <- models[[model_name]]
# Count the number of predictors in the model
num_predictors <- length(all.vars(model_formula)) - 1 # Exclude the response variable
# Fit the model on the training data
model <- glm(model_formula, data = training, family = "binomial")
# Predict probabilities on the validation set
pred_probs <- predict(model, newdata = validate, type = "response")
# Calculate AUC
roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
auc_value <- as.numeric(pROC::auc(roc_obj))
# Calculate LogLoss for validation
logloss <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)
# Extract CV LogLoss
cv_logloss <- min(cv_results[[i]]$results$logLoss)
# Calculate VIF
vifs <- vif(model)
max_vif <- max(vifs, na.rm = TRUE)
# Perform Hosmer-Lemeshow test with dynamically adjusted groups
n_groups <- min(10, floor(nrow(training) / 5)) # Ensure at least 5 observations per group
hoslem_test <- tryCatch(
hoslem.test(model$y, fitted(model), g = n_groups),
error = function(e) NULL
)
hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA
# Get the best thresholds using Youden's method
youden_coords <- tryCatch(
coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "youden", transpose = TRUE),
error = function(e) NULL
)
# Initialize thresholds and metrics
best_threshold <- NA
sensitivity <- NA
specificity <- NA
# Process Youden's Index result
if (!is.null(youden_coords)) {
if (is.atomic(youden_coords)) {
# Single result
best_threshold <- youden_coords["threshold"]
sensitivity <- youden_coords["sensitivity"]
specificity <- youden_coords["specificity"]
} else if (nrow(youden_coords) > 1) {
# Tied thresholds: Select the most balanced sensitivity/specificity
youden_coords$balance <- abs(youden_coords$sensitivity - youden_coords$specificity)
best_coords <- youden_coords[which.min(youden_coords$balance), ]
best_threshold <- best_coords["threshold"]
sensitivity <- best_coords["sensitivity"]
specificity <- best_coords["specificity"]
}
}
# Fallback to closest.topleft if Youden's fails
if (is.na(best_threshold)) {
closest_coords <- tryCatch(
coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "closest.topleft", transpose = TRUE),
error = function(e) NULL
)
if (!is.null(closest_coords)) {
best_threshold <- closest_coords["threshold"]
sensitivity <- closest_coords["sensitivity"]
specificity <- closest_coords["specificity"]
}
}
# Calculate additional metrics: PPV, NPV, Prevalence
ppv <- npv <- prevalence <- NA
if (!is.na(best_threshold)) {
predicted_class <- ifelse(pred_probs >= best_threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
npv <- ifelse("No" %in% rownames(confusion), confusion["No", "No"] / sum(confusion["No", ]), NA)
}
prevalence <- mean(validate$fracture == "Yes")
# Return metrics
list(
Model = model_name,
AUROC = auc_value,
LogLoss_Validation = logloss,
LogLoss_CV = cv_logloss,
MaxVIF = max_vif,
HoslemPValue = hoslem_pval, # Add Hosmer-Lemeshow p-value
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
NPV = npv,
BestThreshold = best_threshold,
NumPredictors = num_predictors,
Prevalence = prevalence
)
})
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))
# Sort the results
results_df_sorted <- results_df[order(-results_df$AUROC, results_df$LogLoss_Validation), ]
# Print the sorted data frame
print(results_df_sorted)
Model AUROC LogLoss_Validation LogLoss_CV MaxVIF
sensitivity6 fit_noage 0.7434846 0.5111676 0.5449074 3.253960
sensitivity5 fitlasso 0.7393822 0.5108552 0.5421378 2.000000
sensitivity8 fit_fsnoage 0.7388996 0.5150674 0.5255116 2.025420
sensitivity7 fit_noscore 0.7362452 0.5124483 0.5424825 3.239427
sensitivity9 fit_fsnoscore 0.7322635 0.5163471 0.5309244 1.153747
sensitivity4 fitlassoroc 0.7322635 0.5180295 0.5234153 1.298430
sensitivity2 fit5432de2 0.7258687 0.5145238 0.5310958 6.205696
sensitivity fit543a 0.7229730 0.5212556 0.5275913 1.149200
sensitivity3 fit5432bde2 0.7188707 0.5161157 0.5377821 6.128962
sensitivity1 fit5432c1 0.7171815 0.5187222 0.5208754 3.767293
HoslemPValue Sensitivity Specificity PPV NPV
sensitivity6 0.9027294 0.8378378 0.5892857 0.4025974 0.9166667
sensitivity5 0.5321824 0.8648649 0.6071429 0.4210526 0.9315068
sensitivity8 0.6696229 0.8378378 0.6339286 0.4305556 0.9220779
sensitivity7 0.7138617 0.8378378 0.5892857 0.4025974 0.9166667
sensitivity9 0.7410397 0.8378378 0.6071429 0.4133333 0.9189189
sensitivity4 0.2657679 0.7837838 0.6696429 0.4393939 0.9036145
sensitivity2 0.8514311 0.8648649 0.5892857 0.4102564 0.9295775
sensitivity 0.2202417 0.7837838 0.6785714 0.4461538 0.9047619
sensitivity3 0.8454629 0.8918919 0.5446429 0.3928571 0.9384615
sensitivity1 0.3174713 0.8108108 0.6428571 0.4285714 0.9113924
BestThreshold NumPredictors Prevalence
sensitivity6 0.1903572 6 0.2483221
sensitivity5 0.2048897 10 0.2483221
sensitivity8 0.1992032 6 0.2483221
sensitivity7 0.1918706 6 0.2483221
sensitivity9 0.1962048 6 0.2483221
sensitivity4 0.2338080 5 0.2483221
sensitivity2 0.1941760 8 0.2483221
sensitivity 0.2358142 6 0.2483221
sensitivity3 0.1926877 9 0.2483221
sensitivity1 0.1984304 8 0.2483221
# Specify the models you want to plot
models_to_plot <- c("fit_noage", "fit_fsnoage", "fit_noscore", "fit_fsnoscore", "fit5432bde2", "fit5432c1", "fitlasso", "fitlassoroc")
# Filter results for the specified models
filtered_results <- results_df_sorted[results_df_sorted$Model %in% models_to_plot, ]
# Create a plot with different colors for each model
ggplot(filtered_results, aes(x = Sensitivity, y = PPV, color = Model, label = Model)) +
geom_point(size = 3) +
geom_text(vjust = -0.5, hjust = 0.5, size = 3) +
labs(
title = "Sensitivity vs. PPV for Selected Models",
x = "Sensitivity",
y = "Positive Predictive Value (PPV)",
color = "Model"
) +
theme_bw() +
scale_color_brewer(palette = "Dark2")
# My first best models for getting a reasonably high sensitivity >.6, followed by PPV and the other metrics were:
# fit5432bde2, lassoroc (nearly as good, much smaller model), fitlasso (for optimizing PPV, with .6 specificity)
# I had picked one: finalObj1Model = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore #fit5432bde2
# After reconsidering variable selection good candidates look like: fit_fsnoage, fitlasso, fit_fsnoscore, fitnoage
Dr. Turner suggested that I find the threshold for each model at for a particular value for one metric, for example PPV of 0.85, so that it would be easier to compare the other metrics for each model. These models couldn’t reach a PPV of 0.85, so I choose 0.5. I repeated the model fitting above but instead of choosing a threshold based on the highest combined sensitivity and specificity, choosing it for a PPV of 0.5.
# List of model formulas
models <- list(
tue = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
tom = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore,
# fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
# fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
# fitlasso_nobmi = fracture ~ priorfrac + momfrac + bonemed_fu + age + height + premeno + armassist + raterisk + smoke, #10
# these I decided to try after thinking more about the variables that should be excluded
# fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
fit_noage2 = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
fit_noagea = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
fit_noage2b = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed,
fit_noscore2b = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + age + bonemed,
fit_noage4a = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed_fu,
fit_noscore3a = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + age + bonemed_fu
# fit_noscore = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + age + bonemed + bonemed_fu,
# fit_noage3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed + bonemed_fu, #9
# fit_noscore3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + age + bonemed + bonemed_fu,
# fit_noage4 = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed + bonemed_fu,
# fit_noscore4 = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + age + bonemed + bonemed_fu,
# fit_noage2a = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed_fu,
# fit_noscore4a = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + age + bonemed_fu,
# fit_noageb = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed,
# fit_noage3b = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed,
# fit_noage4b = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed,
# fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist # 2nd feature sel model no age
# fit_fsnoscore = fracture ~ priorfrac + height + momfrac + age + bonemed_fu + armassist # 2nd feature sel model no fracscore
)
# Set seed for reproducibility
set.seed(1234)
# Set up trainControl for repeated CV
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss
)
# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
model_formula <- models[[model_name]]
train(
model_formula,
data = training,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "logLoss"
)
})
results <- lapply(seq_along(models), function(i) {
model_name <- names(models)[i]
model_formula <- models[[model_name]]
# Count the number of predictors in the model
num_predictors <- length(all.vars(model_formula)) - 1 # Exclude the response variable
# Fit the model on the training data
model <- glm(model_formula, data = training, family = "binomial")
# Predict probabilities on the validation set
pred_probs <- predict(model, newdata = validate, type = "response")
# Calculate AUC
roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
auc_value <- as.numeric(pROC::auc(roc_obj))
# Calculate LogLoss for validation
logloss <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)
# Extract CV LogLoss
cv_logloss <- min(cv_results[[i]]$results$logLoss)
# Calculate VIF
vifs <- vif(model)
max_vif <- max(vifs, na.rm = TRUE)
# Perform Hosmer-Lemeshow test with dynamic bin adjustment
n_groups <- min(10, floor(nrow(training) / 5)) # At least 5 observations per group
hoslem_test <- tryCatch(
hoslem.test(model$y, fitted(model), g = n_groups),
error = function(e) NULL
)
hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA
# Find the threshold for desired PPV (e.g. 0.85)
desired_ppv <- 0.50 # Adjusted for achievable PPV
all_thresholds <- seq(0, 1, by = 0.01) # Search thresholds from 0 to 1
ppv_threshold <- NA
closest_ppv <- Inf
best_threshold <- NA
for (threshold in all_thresholds) {
predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
# Calculate PPV
current_ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
if (!is.na(current_ppv)) {
# Check if the current threshold matches the desired PPV within a margin
if (abs(current_ppv - desired_ppv) < 0.01) {
ppv_threshold <- threshold
break
}
# Track the closest PPV if no exact match is found
if (abs(current_ppv - desired_ppv) < abs(closest_ppv - desired_ppv)) {
closest_ppv <- current_ppv
best_threshold <- threshold
}
}
}
# If no exact PPV match, use the closest threshold
if (is.na(ppv_threshold)) {
ppv_threshold <- best_threshold
}
# Calculate Sensitivity, Specificity, PPV, and NPV at the PPV threshold
sensitivity <- specificity <- ppv <- npv <- NA
if (!is.na(ppv_threshold)) {
predicted_class <- ifelse(pred_probs >= ppv_threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
sensitivity <- confusion["Yes", "Yes"] / sum(confusion[, "Yes"])
specificity <- confusion["No", "No"] / sum(confusion[, "No"])
ppv <- confusion["Yes", "Yes"] / sum(confusion["Yes", ])
npv <- confusion["No", "No"] / sum(confusion["No", ])
}
prevalence <- mean(validate$fracture == "Yes")
# Return metrics
list(
Model = model_name,
NumPredictors = num_predictors,
AUROC = auc_value,
LogLoss_Validation = logloss,
LogLoss_CV = cv_logloss,
MaxVIF = max_vif,
HoslemPValue = hoslem_pval, # Hosmer-Lemeshow p-value
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
NPV = npv,
PPV_Threshold = ppv_threshold
)
})
# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))
# Sort the results
results_df_sorted <- results_df[order(-results_df$Sensitivity, -results_df$AUROC, results_df$LogLoss_Validation), ]
# Print the sorted data frame
print(results_df_sorted)
Model NumPredictors AUROC LogLoss_Validation LogLoss_CV MaxVIF
5 fit_noage2 10 0.7446911 0.5106273 0.5455846 3.216538
4 fit_noage 11 0.7434846 0.5111676 0.5501318 3.253960
6 fit_noagea 11 0.7434846 0.5111676 0.5450384 3.253960
7 fit_noage2b 9 0.7388996 0.5155075 0.5374377 2.615609
8 fit_noscore2b 9 0.7355212 0.5163885 0.5362377 2.000000
3 fit5432bde2 9 0.7188707 0.5161157 0.5339928 6.128962
2 tom 8 0.7468629 0.5106481 0.5331435 2.594040
9 fit_noage4a 8 0.7468629 0.5106481 0.5359266 2.594040
10 fit_noscore3a 9 0.7398649 0.5119081 0.5403481 2.000000
1 tue 8 0.7345560 0.5172054 0.5330209 2.000000
HoslemPValue Sensitivity Specificity PPV NPV PPV_Threshold
5 0.9142989 0.5135135 0.8303571 0.5 0.8378378 0.32
4 0.9027294 0.5135135 0.8303571 0.5 0.8378378 0.32
6 0.9027294 0.5135135 0.8303571 0.5 0.8378378 0.32
7 0.3991309 0.4324324 0.8571429 0.5 0.8205128 0.36
8 0.4142213 0.4324324 0.8571429 0.5 0.8205128 0.34
3 0.8454629 0.2972973 0.9017857 0.5 0.7952756 0.43
2 0.8162808 0.2432432 0.9196429 0.5 0.7862595 0.45
9 0.8162808 0.2432432 0.9196429 0.5 0.7862595 0.45
10 0.3103749 0.2432432 0.9196429 0.5 0.7862595 0.45
1 0.3341375 0.1891892 0.9375000 0.5 0.7777778 0.50
# I think the best model for getting a reasonably high sensitivity, PPV and the other metrics is fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu.
After adjusting the model’s prediction thresholds based on a constant
PPV, this model seems to perform the best, with a comparatively high
AUROC and high sensitivity at a fixed PPV. The VIF values are low and
the Hosmer-Lemeshow test doesn’t suggest evidence against goodness of
fit. It excludes site_id, phy_id, bmi and bonetreat (they are
interactions with variables I kept), and age (it is highly correlated
with fracscore, which performed better).
fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu
Here, I want to rerun the models which include height without the extreme value to compare their performance before and after.
#Train/Validation split without low or high heights
training2<-bone_subset %>% filter(!(bone$height < 140 | bone$height > 190))
validate2<-bone_subset %>% filter(!(bone$height < 140 | bone$height > 190))
# List of model formulas
models <- list(
tue = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
tom = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
fit543a = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height,
# fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore, # no height
fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
fitlasso_nobmi = fracture ~ priorfrac + momfrac + bonemed_fu + age + height + premeno + armassist + raterisk + smoke,
# these I decided to try after thinking more about the variables that should be excluded
# fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu, # -age, reduce multicollinearity
# fit_noscore = fracture ~ . -site_id -phy_id -bmi -bonetreat -fracscore, # reduce multicollinearity
fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist, # 2nd feature sel model no age,
fit_fsnoscore = fracture ~ priorfrac + height + momfrac + bonemed_fu + age + armassist # 2nd feature sel model no age
)
# Set seed for reproducibility
set.seed(1234)
# Set up trainControl for repeated CV
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss
)
# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
model_formula <- models[[model_name]]
train(
model_formula,
data = training2, # Updated to training2
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "logLoss"
)
})
results <- lapply(seq_along(models), function(i) {
model_name <- names(models)[i]
model_formula <- models[[model_name]]
# Count the number of predictors in the model
num_predictors <- length(all.vars(model_formula)) - 1 # Exclude the response variable
# Fit the model on the training data
model <- glm(model_formula, data = training2, family = "binomial") # Updated to training2
# Predict probabilities on the validation set
pred_probs <- predict(model, newdata = validate2, type = "response") # Updated to validate2
# Calculate AUC
roc_obj <- pROC::roc(response = validate2$fracture, predictor = pred_probs, levels = c("No", "Yes")) # Updated to validate2
auc_value <- as.numeric(pROC::auc(roc_obj))
# Calculate LogLoss for validation
logloss <- Metrics::logLoss(actual = ifelse(validate2$fracture == "Yes", 1, 0), predicted = pred_probs) # Updated to validate2
# Extract CV LogLoss
cv_logloss <- min(cv_results[[i]]$results$logLoss)
# Calculate VIF
vifs <- vif(model)
max_vif <- max(vifs, na.rm = TRUE)
# Perform Hosmer-Lemeshow test with dynamic bin adjustment
n_groups <- min(10, floor(nrow(training2) / 5)) # Updated to training2
hoslem_test <- tryCatch(
hoslem.test(model$y, fitted(model), g = n_groups),
error = function(e) NULL
)
hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA
# Find the threshold for desired PPV (e.g. 0.85)
desired_ppv <- 0.50 # Adjusted for achievable PPV
all_thresholds <- seq(0, 1, by = 0.01) # Search thresholds from 0 to 1
ppv_threshold <- NA
closest_ppv <- Inf
best_threshold <- NA
for (threshold in all_thresholds) {
predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate2$fracture) # Updated to validate2
# Calculate PPV
current_ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
if (!is.na(current_ppv)) {
# Check if the current threshold matches the desired PPV within a margin
if (abs(current_ppv - desired_ppv) < 0.01) {
ppv_threshold <- threshold
break
}
# Track the closest PPV if no exact match is found
if (abs(current_ppv - desired_ppv) < abs(closest_ppv - desired_ppv)) {
closest_ppv <- current_ppv
best_threshold <- threshold
}
}
}
# If no exact PPV match, use the closest threshold
if (is.na(ppv_threshold)) {
ppv_threshold <- best_threshold
}
# Calculate Sensitivity, Specificity, PPV, and NPV at the PPV threshold
sensitivity <- specificity <- ppv <- npv <- NA
if (!is.na(ppv_threshold)) {
predicted_class <- ifelse(pred_probs >= ppv_threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate2$fracture) # Updated to validate2
sensitivity <- confusion["Yes", "Yes"] / sum(confusion[, "Yes"])
specificity <- confusion["No", "No"] / sum(confusion[, "No"])
ppv <- confusion["Yes", "Yes"] / sum(confusion["Yes", ])
npv <- confusion["No", "No"] / sum(confusion["No", ])
}
prevalence <- mean(validate2$fracture == "Yes") # Updated to validate2
# Return metrics
list(
Model = model_name,
NumPredictors = num_predictors,
AUROC = auc_value,
LogLoss_Validation = logloss,
LogLoss_CV = cv_logloss,
MaxVIF = max_vif,
HoslemPValue = hoslem_pval, # Hosmer-Lemeshow p-value
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
NPV = npv,
PPV_Threshold = ppv_threshold
)
})
# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))
# Sort the results
results_df_sorted <- results_df[order(-results_df$Sensitivity, -results_df$AUROC, results_df$LogLoss_Validation), ]
# Print the sorted data frame
print(results_df_sorted)
Model NumPredictors AUROC LogLoss_Validation LogLoss_CV MaxVIF
5 fitlasso 10 0.7270787 0.5017758 0.5267315 2.000000
2 tom 8 0.7243402 0.5036073 0.5200329 2.601416
7 fit_noage 11 0.7282646 0.5021859 0.5337317 3.384928
1 tue 8 0.7232405 0.5056559 0.5280232 2.000000
9 fit_fsnoscore 6 0.7213968 0.5068034 0.5180951 1.153743
8 fit_fsnoage 6 0.7212136 0.5076248 0.5234969 2.019138
4 fitlassoroc 5 0.7191974 0.5091290 0.5198037 1.322911
6 fitlasso_nobmi 9 0.7307012 0.5032898 0.5303461 2.000000
3 fit543a 6 0.7098823 0.5086573 0.5213068 1.190909
HoslemPValue Sensitivity Specificity PPV NPV PPV_Threshold
5 0.39409681 0.4838710 0.8342246 0.4918033 0.8297872 0.33
2 0.09267098 0.4838710 0.8342246 0.4918033 0.8297872 0.33
7 0.13340208 0.4596774 0.8422460 0.4913793 0.8246073 0.34
1 0.41711263 0.4354839 0.8502674 0.4909091 0.8195876 0.34
9 0.78307274 0.4032258 0.8609626 0.4901961 0.8131313 0.36
8 0.50459409 0.3064516 0.9010695 0.5066667 0.7966903 0.40
4 0.38981163 0.2903226 0.9064171 0.5070423 0.7939110 0.40
6 0.01578518 0.2822581 0.9090909 0.5072464 0.7925408 0.43
3 0.22735424 0.2500000 0.9197861 0.5081967 0.7871854 0.44
Removing the two extreme height observations does not seem to improve the performance of the models that include height. It seems unlikely that 2 points out of 500 would have a large impact, so I think for now sticking with the previously chosen model would be a good bet.
Here, I want to optimize the threshold of the chosen model(s) to achieve the highest PPV possible with a specificity of at least 0.80. I want to ensure that at least 80% of the patients at high risk of fractures receive treatment and minimize the number of women at low risk that are treated unnecessarily.
# # Tue's
# final_model <- glm(
# formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
# data = training,
# family = "binomial"
# )
# # Tom's
# final_model <- glm(
# formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
# data = training,
# family = "binomial"
# )
# Fit the model on the training data
final_model <- glm(
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
data = training,
family = "binomial"
)
# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")
# Define a target Sensitivity threshold
target_sensitivity <- 0.8 # Adjust to 0.75, 0.85, or desired value
# Function to calculate performance metrics for a given threshold
calculate_metrics <- function(threshold, pred_probs, actual) {
# Predicted classes based on threshold
pred_class <- ifelse(pred_probs > threshold, "Yes", "No")
# Ensure factors have the same levels
actual <- factor(actual, levels = c("No", "Yes"))
pred_class <- factor(pred_class, levels = c("No", "Yes"))
# Generate the confusion matrix
cm <- table(Predicted = pred_class, Actual = actual)
# Extract counts from the confusion matrix
tp <- cm["Yes", "Yes"] # True Positives
fp <- cm["Yes", "No"] # False Positives
tn <- cm["No", "No"] # True Negatives
fn <- cm["No", "Yes"] # False Negatives
# Safely calculate metrics
ppv <- ifelse((tp + fp) > 0, tp / (tp + fp), NA) # Positive Predictive Value (Precision)
sensitivity <- ifelse((tp + fn) > 0, tp / (tp + fn), NA) # Sensitivity (Recall)
specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA) # Specificity
accuracy <- ifelse((tp + tn + fp + fn) > 0, (tp + tn) / (tp + tn + fp + fn), NA) # Accuracy
# Return metrics as a data frame
metrics <- data.frame(
Threshold = threshold,
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
Accuracy = accuracy
)
return(metrics)
}
# Generate metrics for a refined range of thresholds
thresholds <- seq(0.1, 0.4, by = 0.01) # Narrowed range
metrics_list <- lapply(thresholds, calculate_metrics, pred_probs = pred_probs, actual = validate$fracture)
metrics_df <- do.call(rbind, metrics_list)
# Filter metrics for Sensitivity >= target_sensitivity
valid_metrics <- metrics_df[metrics_df$Sensitivity >= target_sensitivity, ]
# Ensure there are valid thresholds
if (nrow(valid_metrics) == 0) {
stop("No thresholds meet the target Sensitivity. Try lowering target_sensitivity.")
}
# Find the best threshold (maximizing PPV)
best_threshold <- valid_metrics$Threshold[which.max(valid_metrics$PPV)]
best_metrics <- valid_metrics[valid_metrics$Threshold == best_threshold, ]
# Output the best threshold and its metrics
cat("Best threshold with Sensitivity >=", target_sensitivity, ":\n")
Best threshold with Sensitivity >= 0.8 :
print(best_metrics)
Threshold Sensitivity Specificity PPV Accuracy
10 0.19 0.8378378 0.5892857 0.4025974 0.6510067
# Plot metrics to visualize trade-offs
metrics_df_long <- melt(metrics_df, id.vars = "Threshold", variable.name = "Metric", value.name = "Value")
plot_metrics <- ggplot(metrics_df_long, aes(x = Threshold, y = Value, color = Metric)) +
geom_line() +
geom_vline(xintercept = best_threshold, linetype = "dashed", color = "gray50") +
geom_text(aes(x = best_threshold, y = 0.5,
label = paste("Best Threshold =", round(best_threshold, 2))),
color = "gray50", angle = 90, hjust = -0.2, size = 2.5) +
labs(title = "Model Performance Metrics by Threshold",
x = "Threshold", y = "Metric Value") +
theme_bw() +
scale_color_brewer(palette = "Set1")
print(plot_metrics)
# Save the plots and metrics if needed
# ggsave("figures/Obj1model_metrics_plotV2.png", plot_metrics)
# write.csv(metrics_df, "figures/Obj1model_metrics_by_thresholdV2.csv", row.names = FALSE)
I found a threshold (0.19), which allowed the model to achieve the largest PPV once a minimum Sensitivity of 0.80 was reached.
There is a trade-off between over-treating or under-treating. We have to consider the cost of someone at high risk having a fracture because they were not diagnosed and treated. Additionally, we need to weigh that against the cost of treating unnecessarily, e.g. inconvenience to the patient, cost of the treatment, and treatment side effects. If we optimize sensitivity with a lower threshold, then we err on the side of over-treating. It seems best to choose a minimum proportion of those at risk that we want to be certain receive treatment and after reaching that minimum try to minimize the cost of treating unnecessarily by making our PPV as high as possible to ensure our positive predictions are correct.
Here, I will get the coefficients for the final Objective 1 additive model for interpretation and set the threshold (using a minimum Sensitivity of at least 0.8 and a moderately high PPV) to get a confusion matrix.
# Fit the model on the training data
final_model <- glm(
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
data = training,
family = "binomial"
)
# # Tue's
# final_model <- glm(
# formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
# data = training,
# family = "binomial"
# )
# # Tom's
# final_model <- glm(
# formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
# data = training,
# family = "binomial"
# )
# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")
# Get the coefficients
coef(summary(final_model))
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.726543862 3.51975849 1.0587499 0.28971371
priorfracYes 0.594130932 0.33028411 1.7988481 0.07204272
weight 0.009551798 0.01041919 0.9167505 0.35927340
height -0.042148335 0.02239631 -1.8819317 0.05984529
premenoYes 0.277711261 0.34974297 0.7940439 0.42716989
momfracYes 0.866747721 0.38674545 2.2411323 0.02501751
armassistYes -0.060558514 0.41071111 -0.1474480 0.88277845
smokeYes -0.397622978 0.55300158 -0.7190268 0.47212439
rateriskSame 0.101713150 0.33774209 0.3011563 0.76329533
rateriskGreater 0.134134342 0.37673618 0.3560432 0.72180824
fracscore 0.193679485 0.08791951 2.2029180 0.02760053
bonemedYes -0.302086003 0.50537118 -0.5977508 0.55000625
bonemed_fuYes 0.594836643 0.49523255 1.2011259 0.22970237
# Exponentiate the coefficients to get the odds ratios
exp(coef(summary(final_model))[,1])
(Intercept) priorfracYes weight height premenoYes
41.5353081 1.8114560 1.0095976 0.9587276 1.3201050
momfracYes armassistYes smokeYes rateriskSame rateriskGreater
2.3791606 0.9412387 0.6719153 1.1070659 1.1435464
fracscore bonemedYes bonemed_fuYes
1.2137072 0.7392745 1.8127348
# Get the confidence intervals
exp(confint(final_model))
2.5 % 97.5 %
(Intercept) 0.04302943 45026.592845
priorfracYes 0.94609161 3.466895
weight 0.98901171 1.030405
height 0.91686391 1.001397
premenoYes 0.65390473 2.593581
momfracYes 1.10571683 5.072012
armassistYes 0.41801097 2.101255
smokeYes 0.20491608 1.860272
rateriskSame 0.57123593 2.157652
rateriskGreater 0.54511300 2.399972
fracscore 1.02268969 1.445135
bonemedYes 0.27154307 1.995866
bonemed_fuYes 0.68076709 4.814276
# Calculate coefficients and confidence intervals via odds ratios
summary_table <- cbind(
OR = exp(coef(summary(final_model))[, 1]), # Odds Ratios
`Lower CI` = exp(confint(final_model)[, 1]), # Lower 95% CI
`Upper CI` = exp(confint(final_model)[, 2]), # Upper 95% CI
`p-value` = coef(summary(final_model))[, 4] # p-values
)
# Convert to a data frame for cleaner output
summary_df <- as.data.frame(summary_table)
summary_df <- cbind(Variable = rownames(summary_table), summary_df)
rownames(summary_df) <- NULL
# Print the summary table
Obj1Model_table = gt(summary_df)
Obj1Model_table
| Variable | OR | Lower CI | Upper CI | p-value |
|---|---|---|---|---|
| (Intercept) | 41.5353081 | 0.04302943 | 45026.592845 | 0.28971371 |
| priorfracYes | 1.8114560 | 0.94609161 | 3.466895 | 0.07204272 |
| weight | 1.0095976 | 0.98901171 | 1.030405 | 0.35927340 |
| height | 0.9587276 | 0.91686391 | 1.001397 | 0.05984529 |
| premenoYes | 1.3201050 | 0.65390473 | 2.593581 | 0.42716989 |
| momfracYes | 2.3791606 | 1.10571683 | 5.072012 | 0.02501751 |
| armassistYes | 0.9412387 | 0.41801097 | 2.101255 | 0.88277845 |
| smokeYes | 0.6719153 | 0.20491608 | 1.860272 | 0.47212439 |
| rateriskSame | 1.1070659 | 0.57123593 | 2.157652 | 0.76329533 |
| rateriskGreater | 1.1435464 | 0.54511300 | 2.399972 | 0.72180824 |
| fracscore | 1.2137072 | 1.02268969 | 1.445135 | 0.02760053 |
| bonemedYes | 0.7392745 | 0.27154307 | 1.995866 | 0.55000625 |
| bonemed_fuYes | 1.8127348 | 0.68076709 | 4.814276 | 0.22970237 |
# Save the table as a png for ppt
# gtsave(Obj1Model_table, "figures/Obj1Model_table.png")
# Categorical interpretation
# Exponentiate the priorfracYes coefficient to get the odds ratio
exp(coef(final_model)['momfracYes'])
momfracYes
2.379161
# Get the percent change in the odds of those with a history of fracture
(exp(coef(final_model)['momfracYes'])-1)*100
momfracYes
137.9161
# Get the confidence interval
exp(confint(final_model)['momfracYes',])
2.5 % 97.5 %
1.105717 5.072012
(exp(confint(final_model)['momfracYes',])-1)*100
2.5 % 97.5 %
10.57168 407.20116
# Numerical interpretation
# Exponentiate the age coefficient to get the odds ratio
exp(coef(final_model)['fracscore'])
fracscore
1.213707
# Get the percent change in the odds with a 1 year increase in age
(exp(coef(final_model)['fracscore'])-1)*100
fracscore
21.37072
# Get the confidence interval
exp(confint(final_model)['fracscore',])
2.5 % 97.5 %
1.022690 1.445135
(exp(confint(final_model)['fracscore',])-1)*100
2.5 % 97.5 %
2.268969 44.513502
# Change the threshold for 0.8+ sensitivity and highest PPV and create the confusion matrix
threshold <- 0.19
# Ensure consistent factor levels
obj1.preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))
# Confusion matrix with explicit positive class
conf_matrix <- confusionMatrix(data = obj1.preds, reference = validate$fracture, positive = "Yes")
print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 66 6
Yes 46 31
Accuracy : 0.651
95% CI : (0.5687, 0.7272)
No Information Rate : 0.7517
P-Value [Acc > NIR] : 0.9977
Kappa : 0.3136
Mcnemar's Test P-Value : 6.362e-08
Sensitivity : 0.8378
Specificity : 0.5893
Pos Pred Value : 0.4026
Neg Pred Value : 0.9167
Prevalence : 0.2483
Detection Rate : 0.2081
Detection Prevalence : 0.5168
Balanced Accuracy : 0.7136
'Positive' Class : Yes
Here, I create an effects plot for each of the predictors in the Objective 1 model. Also for curiosity, I create the model with only the significant variables (at the \(\alpha = 0.10\) level) and compare some metrics.
# Just want to compare a model with only the significant predictors
# mymodel.red <- glm(fracture ~ priorfrac + momfrac + bonemed_fu, data=training, family="binomial")
mymodel.red <- glm(fracture ~ priorfrac + momfrac + fracscore + height, data=training, family="binomial")
summary(mymodel.red)
Call:
glm(formula = fracture ~ priorfrac + momfrac + fracscore + height,
family = "binomial", data = training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.48004 3.35247 1.336 0.1814
priorfracYes 0.67701 0.31129 2.175 0.0296 *
momfracYes 0.92543 0.36526 2.534 0.0113 *
fracscore 0.16721 0.05956 2.807 0.0050 **
height -0.04096 0.02077 -1.972 0.0486 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 395.31 on 350 degrees of freedom
Residual deviance: 358.57 on 346 degrees of freedom
AIC: 368.57
Number of Fisher Scoring iterations: 4
vif(final_model)
GVIF Df GVIF^(1/(2*Df))
priorfrac 1.426703 1 1.194447
weight 1.756250 1 1.325236
height 1.202478 1 1.096576
premeno 1.115866 1 1.056346
momfrac 1.154333 1 1.074399
armassist 2.421225 1 1.556029
smoke 1.066679 1 1.032802
raterisk 1.360374 2 1.079977
fracscore 2.711928 1 1.646793
bonemed 3.154672 1 1.776140
bonemed_fu 3.253960 1 1.803873
vif(mymodel.red)
priorfrac momfrac fracscore height
1.284425 1.052896 1.282773 1.031656
hoslem.test(final_model$y,fitted(final_model))
Hosmer and Lemeshow goodness of fit (GOF) test
data: final_model$y, fitted(final_model)
X-squared = 3.454, df = 8, p-value = 0.9027
hoslem.test(mymodel.red$y,fitted(mymodel.red))
Hosmer and Lemeshow goodness of fit (GOF) test
data: mymodel.red$y, fitted(mymodel.red)
X-squared = 11.97, df = 8, p-value = 0.1526
AIC(final_model)
[1] 379.9978
AIC(mymodel.red)
[1] 368.5736
# Make effects plots for each predictor one at a time
# Obj1 model: fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu
plot_model(final_model, type="pred",terms=c("priorfrac"))
plot_model(final_model,type="pred",terms=c("weight"))
plot_model(final_model,type="pred",terms=c("height"))
plot_model(final_model,type="pred",terms=c("premeno"))
plot_model(final_model, type="pred",terms=c("momfrac"))
plot_model(final_model, type="pred",terms=c("armassist"))
plot_model(final_model,type="pred",terms=c("smoke"))
plot_model(final_model,type="pred",terms=c("raterisk"))
plot_model(final_model,type="pred",terms=c("fracscore"))
plot_model(final_model,type="pred",terms=c("bonemed"))
plot_model(final_model,type="pred",terms=c("bonemed_fu"))
# Just in case for the two that are being interpreted
plot_model(final_model,type="pred",terms=c("fracscore", "momfrac"))
# # Function to save a single effects plot
# save_effects_plot <- function(model, term, filename) {
# # Generate the plot
# effect_plot <- plot_model(model, type = "pred", terms = term)
#
# # Save the plot as a PNG file
# ggsave(filename, plot = effect_plot, width = 8, height = 6, dpi = 300)
# }
#
# # Save effects plots for each predictor in the final model
# save_effects_plot(final_model, "priorfrac", "figuresKH/effectKH_priorfrac.png")
# save_effects_plot(final_model, "weight", "figuresKH/effectKH_weight.png")
# save_effects_plot(final_model, "height", "figuresKH/effectKH_height.png")
# save_effects_plot(final_model, "premeno", "figuresKH/effectKH_premeno.png")
# save_effects_plot(final_model, "momfrac", "figuresKH/effectKH_momfrac.png")
# save_effects_plot(final_model, "armassist", "figuresKH/effectKH_armassist.png")
# save_effects_plot(final_model, "smoke", "figuresKH/effectKH_smoke.png")
# save_effects_plot(final_model, "raterisk", "figuresKH/effectKH_raterisk.png")
# save_effects_plot(final_model, "fracscore", "figuresKH/effectKH_fracscore.png")
# save_effects_plot(final_model, "bonemed", "figuresKH/effectKH_bonemed.png")
# save_effects_plot(final_model, "bonemed_fu", "figuresKH/effectKH_bonemed_fu.png")
# save_effects_plot(final_model, c("fracscore", "momfrac"), "figuresKH/dualeffectsKH_fracscore_momfrac.png")
Objective 1 Findings
Model fitting approach:
For intuition on the most important predictors, GLMNET, LASSO and
stepwise feature selection methods were used with 10-fold
cross-validation on the training split (70/30). Both the logloss error
and the AUROC metrics were used for final model selection. The relative
frequency of each of the predictors among the feature selection models
helped inform the next steps. A manual process was used to compare
models formulated with the most frequent predictors from feature
selection, but mainly with intuition from EDA. Candidate models were
selected by adjusting each model’s prediction threshold based on a
constant PPV and comparing metrics including AUROC, sensitivity and
specificity. Only models with reasonable VIF values and insignificant
Hosmer-Lemeshow p-values were selected.
Selected variables:
fracscore, and priorfrac were included because
visual evidence from EDA suggested that those three variables seemed
most likely to influence the probability of fracture.
priorfrac was present in all three feature selection models
and fracscore in two of three. age was
included in one feature selection model and did seem important in EDA
but age and fracscore were highly correlated
and resulted in high VIFs, so age was excluded.
armassist, momfrac, raterisk,
height, bonemed, and bonemed_fu
also seemed informative based on EDA. momfrac and
height were in two and
armassist and bonemed_fu were in one of three feature
selection models. Multicollinearity was a concern among the medication
variables. I opted to exclude bonetreat because it was an
interaction term of the other two. bmi was excluded because
it is an interaction of weight and height. weight,
premeno and smoke seemed less important from
EDA and in feature selection but manually adding them to candidate
models improved their performance metrics.
The formula of the final Objective 1 model is:
fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu.
The logistic model is:
\[
\log \Big( \frac{\pi}{1-\pi} \Big) = 3.73 + 0.59\text{priorfracYes} +
0.01\text{weight} - 0.04\text{height} + 0.28\text{premenoYes} +
0.87\text{momfracYes} - 0.06\text{armassistYes} - \\ 0.40\text{smokeYes}
+ 0.10\text{rateriskSame} + 0.13\text{rateriskGreater} +
0.19\text{fracscore} -
0.30\text{bonemedYes} + 0.59\text{bonemed_fuYes}
\]
Interpretation of coefficients:
There is sufficient evidence at the \(\alpha = 0.05\) level of significance to suggest that for women with a maternal history of hip fracture, the odds of a fracture within one year of starting the ostoeporosis study are 137.9% higher than women whose mom’s did not have a fracture, receiving the same medication treatments, of the same height weight, relative age of menopause, personal history of fracture, smoking status, fracture score, use of assistance standing up and self accessed risk (an increase of a factor of \(e^{2.3791}\), p-value = 0.0250), 95% CI: [10.6%, 407.2%].
There is also sufficient evidence at the \(\alpha = 0.05\) level of significance to suggest that for every one unit increase in fracscore, the odds of fracture increase by 21.4% for women receiving the same medication treatments, of the same height, weight, relative age of menopause, history of fracture and maternal fracture, smoking status, use of assistance standing up and self accessed risk, (a factor of \(e^{1.214}\), p-value = 0.0276), 95% CI: [2.3%, 44.5%].
Practical versus statistical significance:
When accounting for all of the variables in the model, only two were
statistically significant at the \(\alpha =
0.05\) level in explaining the relationship with fracture,
momfrac and fracscore. The significance level
of those predictors and two others height and
priorfrac increased when the other seven variables were not
included in the model. However including the others decreased the error
and increased the ability to predict on the validation set, highlighting
their practical significance in explaining the relationship with
fracture.
The next goal is to improve prediction performance metrics by producing a logistic regression model with added complexity, a discriminant analysis model, and a non-parametric model.
In developing a complex logistic regression model for Objective 2, I referred back to EDA for possible interactions, add those to the final model from Objective 1 and then performed feature selection. Again, the stepwise feature selection code is commented out to keep the output shorter.
# Objective 1 model with added interactions based on EDA
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:bonemed_fu + priorfrac:fracscore + priorfrac:weight + momfrac:armassist + momfrac:raterisk + momfrac:bonemed + momfrac:bonemed_fu + momfrac:smoke + smoke:height + smoke:raterisk + smoke:bonemed + smoke:bonemed_fu + raterisk:premeno + raterisk:bonemed + raterisk:bonemed_fu + raterisk:weight + weight:bonemed + weight:bonemed_fu
# Here, I use penalized regression for feature selection and stepwise feature selection after adding interaction terms. I'd like to get a sense of the important predictors.
# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
# GLMNET
set.seed(1234)
glmnet.fit<-train(formula,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)
40 x 1 sparse Matrix of class "dgCMatrix"
s=0.02283267
(Intercept) 1.476476766
priorfracYes .
weight .
height -0.021304276
premenoYes .
momfracYes 0.659095437
armassistYes .
smokeYes -0.110653661
rateriskSame .
rateriskGreater .
fracscore 0.134976766
bonemedYes .
bonemed_fuYes .
priorfracYes:premenoYes .
priorfracYes:momfracYes .
priorfracYes:bonemedYes .
priorfracYes:bonemed_fuYes .
priorfracYes:fracscore .
priorfracYes:weight 0.006283359
momfracYes:armassistYes .
momfracYes:rateriskSame .
momfracYes:rateriskGreater .
momfracYes:bonemedYes -0.414588020
momfracYes:bonemed_fuYes .
momfracYes:smokeYes .
height:smokeYes .
smokeYes:rateriskSame .
smokeYes:rateriskGreater .
smokeYes:bonemedYes 1.738869292
smokeYes:bonemed_fuYes .
premenoYes:rateriskSame .
premenoYes:rateriskGreater .
rateriskSame:bonemedYes .
rateriskGreater:bonemedYes .
rateriskSame:bonemed_fuYes 0.332737131
rateriskGreater:bonemed_fuYes .
weight:rateriskSame .
weight:rateriskGreater 0.001438714
weight:bonemedYes .
weight:bonemed_fuYes 0.001807696
# includes height, momfrac, smoke, fracscore, priorfrac:weight, momfrac:bonemed, smoke:bonemed, raterisk:bonemed_fu, weight:raterisk, weight_bonemed_fu
# LASSO
set.seed(1234)
lasso.fit<-train(formula,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss",
tuneGrid = expand.grid(alpha = 1, # Lasso penalty
lambda = seq(0.01, 10, length = 200)) # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)
40 x 1 sparse Matrix of class "dgCMatrix"
s=0.01
(Intercept) 3.498543544
priorfracYes .
weight .
height -0.035913319
premenoYes 0.343514036
momfracYes 1.058698672
armassistYes .
smokeYes -0.614588877
rateriskSame .
rateriskGreater .
fracscore 0.164349002
bonemedYes .
bonemed_fuYes .
priorfracYes:premenoYes -0.718306487
priorfracYes:momfracYes 0.005806747
priorfracYes:bonemedYes -0.313078405
priorfracYes:bonemed_fuYes .
priorfracYes:fracscore .
priorfracYes:weight 0.009992061
momfracYes:armassistYes .
momfracYes:rateriskSame 0.049235185
momfracYes:rateriskGreater .
momfracYes:bonemedYes -1.280908089
momfracYes:bonemed_fuYes .
momfracYes:smokeYes -0.390256510
height:smokeYes .
smokeYes:rateriskSame .
smokeYes:rateriskGreater .
smokeYes:bonemedYes 3.325398975
smokeYes:bonemed_fuYes .
premenoYes:rateriskSame .
premenoYes:rateriskGreater .
rateriskSame:bonemedYes .
rateriskGreater:bonemedYes -0.127412189
rateriskSame:bonemed_fuYes 0.607607673
rateriskGreater:bonemed_fuYes .
weight:rateriskSame .
weight:rateriskGreater 0.004563678
weight:bonemedYes .
weight:bonemed_fuYes 0.003436180
# height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu
#STEPWISE
set.seed(1234)
# Step.fit<-train(formula,
# data=training,
# method="glmStepAIC",
# trControl=fitControl,
# metric="logLoss")
# coef(Step.fit$finalModel)
# priorfrac, height, premeno, momfrac, smoke, fracscore, bonemed_fu, priorfrac:premeno, priorfrac:bonemed, momfrac:bonemed, height:smoke, smoke:raterisk, smoke:bonemed, raterisk:bonemed, raterisk:bonemed_fu, weight:bonemed
Here are the variables in frequency of appearance in feature
selection models:
3: height, momfrac, smoke, fracscore, momfrac:bonemed, smoke:bonemed,
raterisk:bonemed_fu
2: premeno, priorfrac:premeno, priorfrac:bonemed, priorfrac:weight,
weight:raterisk, weight:bonemed_fu, raterisk:bonemed
1: priorfrac, bonemed_fu, priorfrac:momfrac, momfrac:raterisk,
momfrac:smoke, height:smoke, smoke:raterisk, weight:bonemed
Here, I do the same thing as above including polynomial terms also. I refer back to EDA for possible polynomial trends (in weight), add those to the final model from Objective 1, and then perform feature selection. Again, the stepwise feature selection code is commented out to keep the output shorter.
# Objective 1 model with added interactions and polynomial terms based on EDA
formula2 = fracture ~ priorfrac + weight + poly(weight, 2) + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:bonemed_fu + priorfrac:fracscore + priorfrac:weight + priorfrac:poly(weight, 2) + momfrac:armassist + momfrac:raterisk + momfrac:bonemed + momfrac:bonemed_fu + momfrac:smoke + smoke:height + smoke:raterisk + smoke:bonemed + smoke:bonemed_fu + raterisk:premeno + raterisk:bonemed + raterisk:bonemed_fu + raterisk:weight + raterisk:poly(weight, 2) + weight:bonemed + poly(weight, 2):bonemed + weight:bonemed_fu + poly(weight, 2):bonemed_fu
# Here, I use penalized regression for feature selection and stepwise feature selection after including the added interactions above and adding polynomial terms for weight. I'd like to get a sense of the important predictors.
# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
# GLMNET
set.seed(1234)
glmnet.fit<-train(formula2,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)
52 x 1 sparse Matrix of class "dgCMatrix"
s=0.02283267
(Intercept) 2.0731617473
priorfracYes .
weight .
poly(weight, 2)1 .
poly(weight, 2)2 -0.0009882277
height -0.0249281452
premenoYes .
momfracYes 0.6755778494
armassistYes .
smokeYes -0.0962459861
rateriskSame .
rateriskGreater .
fracscore 0.1385583904
bonemedYes .
bonemed_fuYes .
priorfracYes:premenoYes .
priorfracYes:momfracYes .
priorfracYes:bonemedYes .
priorfracYes:bonemed_fuYes .
priorfracYes:fracscore .
priorfracYes:weight 0.0056124645
priorfracYes:poly(weight, 2)1 .
priorfracYes:poly(weight, 2)2 .
momfracYes:armassistYes .
momfracYes:rateriskSame .
momfracYes:rateriskGreater .
momfracYes:bonemedYes -0.3756747469
momfracYes:bonemed_fuYes .
momfracYes:smokeYes .
height:smokeYes .
smokeYes:rateriskSame .
smokeYes:rateriskGreater .
smokeYes:bonemedYes 1.8299195816
smokeYes:bonemed_fuYes .
premenoYes:rateriskSame .
premenoYes:rateriskGreater .
rateriskSame:bonemedYes .
rateriskGreater:bonemedYes .
rateriskSame:bonemed_fuYes 0.2622511787
rateriskGreater:bonemed_fuYes .
weight:rateriskSame .
weight:rateriskGreater 0.0008896317
poly(weight, 2)1:rateriskSame .
poly(weight, 2)2:rateriskSame .
poly(weight, 2)1:rateriskGreater 3.1471920371
poly(weight, 2)2:rateriskGreater .
weight:bonemedYes .
poly(weight, 2)1:bonemedYes .
poly(weight, 2)2:bonemedYes 0.0169743280
weight:bonemed_fuYes 0.0023961222
poly(weight, 2)1:bonemed_fuYes .
poly(weight, 2)2:bonemed_fuYes .
# includes poly(weight, 2), height, momfrac, smoke, fracscore, priorfrac:weight, momfrac:bonemed, smoke:bonemed, raterisk:bonemed_fu, weight:raterisk, poly(weight, 2):raterisk, poly(weight, 2):bonemed, weight_bonemed_fu
# LASSO
set.seed(1234)
lasso.fit<-train(formula2,
data=training,
method="glmnet",
trControl=fitControl,
metric="logLoss",
tuneGrid = expand.grid(alpha = 1, # Lasso penalty
lambda = seq(0.01, 10, length = 200)) # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)
52 x 1 sparse Matrix of class "dgCMatrix"
s=0.06020101
(Intercept) -1.525625420
priorfracYes .
weight .
poly(weight, 2)1 .
poly(weight, 2)2 .
height .
premenoYes .
momfracYes .
armassistYes .
smokeYes .
rateriskSame .
rateriskGreater .
fracscore 0.096590670
bonemedYes .
bonemed_fuYes .
priorfracYes:premenoYes .
priorfracYes:momfracYes .
priorfracYes:bonemedYes .
priorfracYes:bonemed_fuYes .
priorfracYes:fracscore .
priorfracYes:weight 0.002845022
priorfracYes:poly(weight, 2)1 .
priorfracYes:poly(weight, 2)2 .
momfracYes:armassistYes .
momfracYes:rateriskSame .
momfracYes:rateriskGreater .
momfracYes:bonemedYes .
momfracYes:bonemed_fuYes .
momfracYes:smokeYes .
height:smokeYes .
smokeYes:rateriskSame .
smokeYes:rateriskGreater .
smokeYes:bonemedYes 0.100878082
smokeYes:bonemed_fuYes .
premenoYes:rateriskSame .
premenoYes:rateriskGreater .
rateriskSame:bonemedYes .
rateriskGreater:bonemedYes .
rateriskSame:bonemed_fuYes .
rateriskGreater:bonemed_fuYes .
weight:rateriskSame .
weight:rateriskGreater .
poly(weight, 2)1:rateriskSame .
poly(weight, 2)2:rateriskSame .
poly(weight, 2)1:rateriskGreater .
poly(weight, 2)2:rateriskGreater .
weight:bonemedYes .
poly(weight, 2)1:bonemedYes .
poly(weight, 2)2:bonemedYes .
weight:bonemed_fuYes .
poly(weight, 2)1:bonemed_fuYes .
poly(weight, 2)2:bonemed_fuYes .
# fracscore, priorfrac:weight, smoke:bonemed
#STEPWISE
# set.seed(1234)
# #note CV and error metric are not really used here, but logLoss is reported for the final model.
# Step.fit<-train(formula2,
# data=training,
# method="glmStepAIC",
# trControl=fitControl,
# metric="logLoss")
# coef(Step.fit$finalModel)
# poly(weight, 2), height, premeno, smoke, fracscore, bonemed_fu, priorfrac:premeno, priorfrac:bonemed, priorfrac:weight, priorfrac:poly(weight, 2), momfrac:raterisk, momfrac:bonemed, height:smoke, smoke:raterisk, smoke:bonemed, raterisk:bonemed, raterisk:bonemed_fu, weight:bonemed, poly(weight, 2):bonemed
Here, I fit a bunch of models based on the feature selection steps above. I compare the number of predictors, logloss error from both the k-fold cross-validation and validation set, AUROC, sensitivity, specificity, PPV, NPV, and prevalence based on a threshold calculated to achieve a set PPV to make comparison of the other metrics easier.
# List of model formulas
models <- list(
obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
# intfit1 = fracture ~ height + momfrac + smoke + fracscore + priorfrac:weight + momfrac:bonemed + smoke:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu,
# intfit2_orig = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu,
# intfit2_modA = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed,
# intfit2_modB = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed,
# intfit2b = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
intfit2 = fracture ~ priorfrac + bonemed + bonemed_fu + height + weight + raterisk + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
# intfit3 = fracture ~ priorfrac + height + premeno + momfrac + smoke + fracscore + bonemed_fu + priorfrac:premeno + priorfrac:bonemed + momfrac:bonemed + height:smoke + smoke:raterisk + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:bonemed, # borderline hoslem
# intfit4 = fracture ~ height + momfrac + smoke + fracscore + premeno + priorfrac + bonemed_fu + momfrac:bonemed + priorfrac:premeno + priorfrac:bonemed + priorfrac:momfrac + momfrac:smoke + height:smoke, # low hoslem
# polyfit1 = fracture ~ poly(weight, 2) + height + momfrac + smoke + fracscore + priorfrac:weight + momfrac:bonemed + smoke:bonemed + raterisk:bonemed_fu + weight:raterisk + poly(weight, 2):raterisk + poly(weight, 2):bonemed + weight:bonemed_fu, # low AUROC
# polyfit2 = fracture ~ fracscore + priorfrac:weight + smoke:bonemed, # low AUROC
# polyfit3 = fracture ~ poly(weight, 2) + height + premeno + smoke + fracscore + bonemed_fu + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + momfrac:raterisk + momfrac:bonemed + height:smoke + smoke:raterisk + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:bonemed + poly(weight, 2):bonemed, # high logloss
polyfit4 = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)
# Set seed for reproducibility
set.seed(1234)
# Set up trainControl for repeated CV
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss
)
# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
model_formula <- models[[model_name]]
train_result <- train(
model_formula,
data = training,
method = "glm",
family = "binomial",
trControl = fitControl,
metric = "logLoss"
)
# Return the train object and minimum log loss
list(model = train_result, min_logloss_cv = min(train_result$results$logLoss, na.rm = TRUE))
})
# Safely calculate confusion matrix metrics
calc_confusion_metrics <- function(confusion) {
sensitivity <- ifelse("Yes" %in% colnames(confusion),
confusion["Yes", "Yes"] / sum(confusion[, "Yes"]), NA)
specificity <- ifelse("No" %in% colnames(confusion),
confusion["No", "No"] / sum(confusion[, "No"]), NA)
ppv <- ifelse("Yes" %in% rownames(confusion),
confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
npv <- ifelse("No" %in% rownames(confusion),
confusion["No", "No"] / sum(confusion["No", ]), NA)
accuracy <- sum(diag(confusion)) / sum(confusion)
misclassification_rate <- 1 - accuracy
f1_score <- ifelse(!is.na(ppv) & !is.na(sensitivity) & (ppv + sensitivity) > 0,
2 * (ppv * sensitivity) / (ppv + sensitivity), NA)
list(
sensitivity = sensitivity,
specificity = specificity,
ppv = ppv,
npv = npv,
f1_score = f1_score,
misclassification_rate = misclassification_rate
)
}
# Fit models and calculate metrics
results <- lapply(seq_along(models), function(i) {
model_name <- names(models)[i]
model_formula <- models[[model_name]]
# Fit the model using glm
model <- glm(model_formula, data = training, family = "binomial")
# Predict probabilities on the validation set
pred_probs <- predict(model, newdata = validate, type = "response")
# Calculate ROC and AUC
roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
auc_value <- as.numeric(pROC::auc(roc_obj))
# Calculate Log Loss on the validation set
logloss_validation <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)
# Retrieve Log Loss from CV
cv_logloss <- cv_results[[i]]$min_logloss_cv
# Determine the best threshold for a desired PPV
desired_ppv <- 0.50
all_thresholds <- seq(0, 1, by = 0.01)
threshold_metrics <- sapply(all_thresholds, function(threshold) {
predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
ppv <- ifelse("Yes" %in% rownames(confusion),
confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
abs(ppv - desired_ppv)
})
best_threshold <- all_thresholds[which.min(threshold_metrics)]
# Metrics at the best threshold
predicted_class <- ifelse(pred_probs >= best_threshold, "Yes", "No")
confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
metrics <- calc_confusion_metrics(confusion)
# Perform the Hosmer-Lemeshow test
n_groups <- min(10, floor(nrow(training) / 5))
hoslem_test <- tryCatch(hoslem.test(model$y, fitted(model), g = n_groups), error = function(e) NULL)
hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA
# Return all metrics
list(
Model = model_name,
AUROC = auc_value,
LogLoss_Validation = logloss_validation,
LogLoss_CV = cv_logloss,
HoslemPValue = hoslem_pval,
Sensitivity = metrics$sensitivity,
Specificity = metrics$specificity,
PPV = metrics$ppv,
NPV = metrics$npv,
F1_Score = metrics$f1_score,
MisclassificationRate = metrics$misclassification_rate,
Threshold = best_threshold
)
})
# Combine results into a data frame
results_df <- do.call(rbind, lapply(results, as.data.frame))
# Sort and display results
results_df_sorted <- results_df[order(-results_df$AUROC, results_df$LogLoss_Validation), ]
print(results_df_sorted)
Model AUROC LogLoss_Validation LogLoss_CV HoslemPValue Sensitivity
2 intfit2 0.7625483 0.5000117 0.6159578 0.3592725 0.6756757
1 obj1fit 0.7434846 0.5111676 0.5471671 0.9027294 0.5135135
3 polyfit4 0.7379344 0.5339562 0.5336331 0.4554668 0.7567568
Specificity PPV NPV F1_Score MisclassificationRate Threshold
2 0.7767857 0.5 0.8787879 0.5747126 0.2483221 0.29
1 0.8303571 0.5 0.8378378 0.5066667 0.2483221 0.32
3 0.7500000 0.5 0.9032258 0.6021505 0.2483221 0.27
After adjusting the model’s prediction thresholds based on a constant PPV, two complex models seems to perform pretty well, with a comparatively low logloss, high AUROC, high sensitivity. Additionally, the Hosmer-Lemeshow test doesn’t suggest evidence against goodness of fit.
Objective 1 Model:
fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu
Obj 2 Model with Interactions (intfit2):
fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu
Obj 2 Model with Interactions and Polynomials (polyfit4):
fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
Here, I plot ROC curves of the Objective 1 model and the Objective 2 candidate models.
# Fit the models
# Tue's
tue <- glm(
formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
data = training,
family = "binomial"
)
# Tom's
tom <- glm(
formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
data = training,
family = "binomial"
)
obj1fit = glm(fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu, data = training, family = "binomial")
intfit2 = glm(fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu, data = training, family = "binomial")
polyfit4 = glm(fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed, data = training, family = "binomial")
# Predicting probabilities on the validation data
tue.predprobs <- predict(tue, validate, type = "response")
tom.predprobs <- predict(tom, validate, type = "response")
obj1fit.predprobs <- predict(obj1fit, validate, type = "response")
intfit2.predprobs <- predict(intfit2, validate, type = "response")
polyfit4.predprobs <- predict(polyfit4, validate, type = "response")
# Calculate ROC and AUROC
tue.roc <- roc(response = validate$fracture, predictor = tom.predprobs, levels = c("No", "Yes"))
tom.roc <- roc(response = validate$fracture, predictor = tue.predprobs, levels = c("No", "Yes"))
obj1fit.roc <- roc(response = validate$fracture, predictor = obj1fit.predprobs, levels = c("No", "Yes"))
intfit2.roc <- roc(response = validate$fracture, predictor = intfit2.predprobs, levels = c("No", "Yes"))
polyfit4.roc <- roc(response = validate$fracture, predictor = polyfit4.predprobs, levels = c("No", "Yes"))
# Extract AUROC values
tue.auc <- round(auc(obj1fit.roc), 3)
tom.auc <- round(auc(obj1fit.roc), 3)
obj1fit.auc <- round(auc(obj1fit.roc), 3)
intfit2.auc <- round(auc(intfit2.roc), 3)
polyfit4.auc <- round(auc(polyfit4.roc), 3)
# Save plot to a PNG file
# png("figures/ROC_Curves_with_AUROC.png", width = 800, height = 600)
# Plot ROC curves
plot(obj1fit.roc, col = "gray40", lwd = 2, lty = 1, main = "ROC Curves for Candidate Models") #print.thres="best"
plot(tue.roc, col = "purple3", lwd = 2, lty = 1, add = TRUE)
plot(tom.roc, col = "green4", lwd = 2, lty = 1, add = TRUE)
plot(intfit2.roc, col = "red3", lwd = 2, lty = 1, add = TRUE)
plot(polyfit4.roc, col = "blue2", lwd = 2, lty = 1, add = TRUE)
# Add legend with AUROC values
legend("bottomright",
legend = c(
paste("Obj1 (AUROC =", obj1fit.auc, ")"),
paste("Tue (AUROC =", tue.auc, ")"),
paste("Tom (AUROC =", tom.auc, ")"),
paste("Int (AUROC =", intfit2.auc, ")"),
paste("Poly (AUROC =", polyfit4.auc, ")")
),
col = c("gray40", "purple3","green4", "red3", "blue2"),
lwd = 2, lty = c(1, 1, 1, 1, 1), cex = 1, xpd = TRUE, horiz = FALSE)
# Close the graphics device
# dev.off()
Here as before, I want to optimize the threshold of the chosen model(s) to achieve the highest PPV possible with a specificity of at least 0.80. I want to ensure that at least 80% of the patients at high risk of fractures receive treatment and minimize the number of women at low risk that are treated unnecessarily.
# Fit the model on the training data
final_model <- glm(
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
data = training,
family = "binomial"
)
# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")
# Define a target Sensitivity threshold
target_sensitivity <- 0.8 # Adjust to 0.75, 0.85, or desired value
# Function to calculate performance metrics for a given threshold
calculate_metrics <- function(threshold, pred_probs, actual) {
# Predicted classes based on threshold
pred_class <- ifelse(pred_probs > threshold, "Yes", "No")
# Ensure factors have the same levels
actual <- factor(actual, levels = c("No", "Yes"))
pred_class <- factor(pred_class, levels = c("No", "Yes"))
# Generate the confusion matrix
cm <- table(Predicted = pred_class, Actual = actual)
# Extract counts from the confusion matrix
tp <- cm["Yes", "Yes"] # True Positives
fp <- cm["Yes", "No"] # False Positives
tn <- cm["No", "No"] # True Negatives
fn <- cm["No", "Yes"] # False Negatives
# Safely calculate metrics
ppv <- ifelse((tp + fp) > 0, tp / (tp + fp), NA) # Positive Predictive Value (Precision)
sensitivity <- ifelse((tp + fn) > 0, tp / (tp + fn), NA) # Sensitivity (Recall)
specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA) # Specificity
accuracy <- ifelse((tp + tn + fp + fn) > 0, (tp + tn) / (tp + tn + fp + fn), NA) # Accuracy
# Return metrics as a data frame
metrics <- data.frame(
Threshold = threshold,
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
Accuracy = accuracy
)
return(metrics)
}
# Generate metrics for a refined range of thresholds
thresholds <- seq(0.1, 0.4, by = 0.01) # Narrowed range
metrics_list <- lapply(thresholds, calculate_metrics, pred_probs = pred_probs, actual = validate$fracture)
metrics_df <- do.call(rbind, metrics_list)
# Filter metrics for Sensitivity >= target_sensitivity
valid_metrics <- metrics_df[metrics_df$Sensitivity >= target_sensitivity, ]
# Ensure there are valid thresholds
if (nrow(valid_metrics) == 0) {
stop("No thresholds meet the target Sensitivity. Try lowering target_sensitivity.")
}
# Find the best threshold (maximizing PPV)
best_threshold <- valid_metrics$Threshold[which.max(valid_metrics$PPV)]
best_metrics <- valid_metrics[valid_metrics$Threshold == best_threshold, ]
# Output the best threshold and its metrics
cat("Best threshold with Sensitivity >=", target_sensitivity, ":\n")
Best threshold with Sensitivity >= 0.8 :
print(best_metrics)
Threshold Sensitivity Specificity PPV Accuracy
11 0.2 0.8108108 0.6428571 0.4285714 0.6845638
# Plot metrics to visualize trade-offs
metrics_df_long <- melt(metrics_df, id.vars = "Threshold", variable.name = "Metric", value.name = "Value")
plot_metrics <- ggplot(metrics_df_long, aes(x = Threshold, y = Value, color = Metric)) +
geom_line() +
geom_vline(xintercept = best_threshold, linetype = "dashed", color = "gray50") +
geom_text(aes(x = best_threshold, y = 0.5,
label = paste("Best Threshold =", round(best_threshold, 2))),
color = "gray50", angle = 90, hjust = -0.2, size = 2.5) +
labs(title = "Model Performance Metrics by Threshold",
x = "Threshold", y = "Metric Value") +
theme_bw() +
scale_color_brewer(palette = "Set1")
print(plot_metrics)
# Save the plots and metrics if needed
# ggsave("figures/Obj2model_metrics_plotV2.png", plot_metrics)
# write.csv(metrics_df, "figures/Obj2model_metrics_by_thresholdV2.csv", row.names = FALSE)
I found a threshold (0.20), which allowed the model to achieve the largest PPV once a minimum Sensitivity of 0.80 was reached.
There is a trade-off between over-treating or under-treating. We have to consider the cost of someone at high risk having a fracture because they were not diagnosed and treated. Additionally we need to weigh that against the cost of treating unnecessarily, e.g. inconvenience to the patient, cost of the treatment, and treatment side effects. If we optimize sensitivity with a lower threshold, then we err on the side of over-treating. It seems best to choose a minimum proportion of those at risk that we want to be certain receive treatment and after reaching that minimum try to minimize the cost of treating unnecessarily by making our PPV as high as possible to ensure our positive predictions are correct.
This looks at the bias-variance trade-off.
# Define models of increasing complexity
models <- list(
"Simple" = fracture ~ priorfrac + momfrac + bonemed_fu,
"Additive" = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
"Tue" = fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
"Tom" = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
"Multiplicative" = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
"Quadratic" = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)
# Initialize an empty data frame to store errors and metrics
errors <- data.frame()
# Train models and calculate errors
set.seed(1234)
for (i in seq_along(models)) {
model_name <- names(models)[i]
model_formula <- models[[i]]
# Fit model
model <- glm(model_formula, data = training, family = "binomial")
# Predict on training and validation sets
train_preds <- predict(model, newdata = training, type = "response")
val_preds <- predict(model, newdata = validate, type = "response")
# Compute log-loss (or classification error)
train_error <- Metrics::logLoss(actual = ifelse(training$fracture == "Yes", 1, 0), predicted = train_preds)
val_error <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = val_preds)
# Calculate ROC metric for the validation set
roc_obj <- pROC::roc(response = validate$fracture, predictor = val_preds, levels = c("No", "Yes"))
roc_auc <- as.numeric(pROC::auc(roc_obj))
# Store results
errors <- rbind(errors, data.frame(
Model = model_name,
Complexity = i, # Higher index -> more complex
TrainingError = train_error,
ValidationError = val_error,
ROC = roc_auc
))
}
# Plot training and validation errors vs complexity
ggplot(errors, aes(x = Complexity)) +
geom_line(aes(y = TrainingError, color = "Training Logloss Error"), linewidth = 1) +
geom_line(aes(y = ValidationError, color = "Validation Logloss Error"), linewidth = 1) +
geom_line(aes(y = ROC, color = "ROC"), linewidth = 1, linetype = "dotted") +
geom_point(aes(y = TrainingError, color = "Training Logloss Error"), size = 1.5) +
geom_point(aes(y = ValidationError, color = "Validation Logloss Error"), size = 1.5) +
geom_point(aes(y = ROC, color = "ROC"), size = 1) +
scale_x_continuous(breaks = 1:length(models), labels = names(models)) +
labs(
title = "Bias-Variance Tradeoff",
x = "Model Complexity",
y = "Metric",
color = "Metric Type"
) +
theme_bw() +
scale_color_manual(values = c("gray30", "turquoise4", "darkorange3"))
Here, I will get the coefficients for the final Objective 2 model for prediction and set a threshold balancing the trade-offs between misclassifications, over-treating, and undertreating to get a confusion matrix. Optimizing the threshold is very subjective, and it is something we may want to revisit. I have choosen to achieve at least 80% sensitivity to avoid missing high risk patients, and then get as high a PPV as possible to unnecessarily treat as few as possible.
# Fit the model on the training data
final_model <- glm(
fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
data = training,
family = "binomial"
)
# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")
# Get the model summary
# summary(final_model)
# Calculate coefficients and confidence intervals via odds ratios
# Align the confidence intervals with the coefficients
conf_intervals <- confint(final_model)
coefficients <- coef(summary(final_model))
# Find common row names to align them
common_rows <- intersect(rownames(coefficients), rownames(conf_intervals))
# Subset both to only include rows present in both
aligned_coefficients <- coefficients[common_rows, , drop = FALSE]
aligned_conf_intervals <- conf_intervals[common_rows, , drop = FALSE]
# Construct the summary table
summary_table <- cbind(
OR = exp(aligned_coefficients[, 1]), # Odds Ratios
`Lower CI` = exp(aligned_conf_intervals[, 1]), # Lower 95% CI
`Upper CI` = exp(aligned_conf_intervals[, 2]), # Upper 95% CI
`p-value` = aligned_coefficients[, 4] # p-values
)
# Convert to data frame for readability
summary_df <- as.data.frame(summary_table)
summary_df <- cbind(Variable = rownames(summary_table), summary_df)
rownames(summary_df) <- NULL
# Print the summary table
Obj2Model_table = gt(summary_df)
Obj2Model_table
| Variable | OR | Lower CI | Upper CI | p-value |
|---|---|---|---|---|
| (Intercept) | 1.429096e+02 | 8.796145e-02 | 2.786799e+05 | 0.19211779 |
| priorfracYes | 8.735511e-01 | 5.094300e-02 | 1.477655e+01 | 0.92520717 |
| weight | 1.003792e+00 | 9.786629e-01 | 1.028368e+00 | 0.76319893 |
| height | 9.541155e-01 | 9.097152e-01 | 9.990308e-01 | 0.04843128 |
| premenoYes | 1.492206e+00 | 7.157343e-01 | 3.050440e+00 | 0.27664087 |
| momfracYes | 2.327037e+00 | 3.043107e-01 | 1.248026e+01 | 0.34962474 |
| smokeYes | 2.989384e-01 | 4.276989e-02 | 1.227005e+00 | 0.14115928 |
| rateriskSame | 6.613486e-01 | 2.779474e-01 | 1.526743e+00 | 0.33810101 |
| rateriskGreater | 8.848926e-01 | 3.243645e-01 | 2.298111e+00 | 0.80527645 |
| fracscore | 1.226205e+00 | 1.081536e+00 | 1.395238e+00 | 0.00163925 |
| bonemedYes | 9.499833e-01 | 2.583419e-01 | 3.410782e+00 | 0.93747251 |
| bonemed_fuYes | 6.828775e-01 | 8.545905e-02 | 3.648911e+00 | 0.67934261 |
| priorfracYes:bonemedYes | 4.832539e-01 | 1.307311e-01 | 1.780721e+00 | 0.27339477 |
| priorfracYes:weight | 1.013018e+00 | 9.767862e-01 | 1.051070e+00 | 0.48671485 |
| momfracYes:rateriskSame | 2.337376e+00 | 2.909083e-01 | 2.415223e+01 | 0.43910281 |
| momfracYes:rateriskGreater | 2.331703e+00 | 2.746159e-01 | 2.543676e+01 | 0.45259513 |
| momfracYes:bonemedYes | 1.005446e-01 | 1.488388e-02 | 5.672243e-01 | 0.01201033 |
| momfracYes:smokeYes | 7.081387e-07 | NA | 9.234488e+121 | 0.99223692 |
| smokeYes:bonemedYes | 4.694597e+07 | 1.511664e-37 | NA | 0.98225338 |
| rateriskSame:bonemed_fuYes | 6.155011e+00 | 1.021168e+00 | 5.348160e+01 | 0.06332783 |
| rateriskGreater:bonemed_fuYes | 2.612131e+00 | 4.356860e-01 | 2.262182e+01 | 0.32547658 |
# Save the table as a png for ppt
# gtsave(Obj2Model_table, "figures/Obj2Model_tableV2.png")
# Change the threshold for at least 0.8 sensitivity and highest PPV and create the confusion matrix
threshold <- 0.20 # overtreat 40, undertreat 7 -> incorrect 47, correct 102
# Ensure consistent factor levels
obj2.preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))
# Confusion matrix with explicit positive class
conf_matrix <- confusionMatrix(data = obj2.preds, reference = validate$fracture, positive = "Yes")
print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 72 7
Yes 40 30
Accuracy : 0.6846
95% CI : (0.6035, 0.7582)
No Information Rate : 0.7517
P-Value [Acc > NIR] : 0.9745
Kappa : 0.3493
Mcnemar's Test P-Value : 3.046e-06
Sensitivity : 0.8108
Specificity : 0.6429
Pos Pred Value : 0.4286
Neg Pred Value : 0.9114
Prevalence : 0.2483
Detection Rate : 0.2013
Detection Prevalence : 0.4698
Balanced Accuracy : 0.7268
'Positive' Class : Yes
The formula of the final Objective 2 logistic regression model
is:
fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu.
With a threshold of 0.20 to prioritize treating high risk individuals
while maintain reasonable avoidance of unnecessary intervention for low
risk individuals, the model achieves the following metrics: - AUROC:
0.76 - Sensitivity: 0.81
- Specificity: 0.64 - PPV: 0.43
- NPV: 0.91
- Prevalence: 0.25
- Logloss Validation: 0.50 - Logloss CV: 0.62
- Hoslem p-value: 0.36
Here, I try different predictors in both LDA and QDA. The caret package essentially dummetizes the categorical predictors, which may weaken the assumption of MVN.
set.seed(1234)
# Define trainControl
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss)
# Ensure levels of 'fracture' are consistent in both datasets
training$fracture <- factor(training$fracture, levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))
# Define predictor combinations
model_formulas <- list(
obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
fit_num1 = fracture ~ weight + height + fracscore,
# fit_numInt1 = fracture ~ weight + height + weight:height + fracscore,
# fit_numPoly1 = fracture ~ weight + poly(weight, 2) + height + fracscore,
# fit_num2 = fracture ~ weight + height + fracscore + age,
fit_num3 = fracture ~ bmi + fracscore + age + weight + height,
# fit_numPoly3 = fracture ~ bmi + fracscore + age + weight + poly(weight, 2) + height,
# fit3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed_fu,
# fit5 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed_fu,
fit2b = fracture ~ priorfrac + weight + height + momfrac + smoke + raterisk + fracscore + bonemed_fu,
fit4b = fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu
# Multiplicative = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu
# Quadratic = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)
# Define thresholds to test
thresholds <- seq(0.1, 0.9, by = 0.01)
# Function to calculate log loss manually
calculate_logloss <- function(obs, pred_probs) {
epsilon <- 1e-15 # Avoid log(0)
pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
-mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
}
# Function to evaluate a single model/formula
evaluate_model <- function(model, formula_name, formula, training, validate, thresholds) {
# Try-catch to handle model fitting errors
fit <- tryCatch({
train(
formula,
data = training,
method = model,
trControl = fitControl,
metric = "logLoss"
)
}, error = function(e) return(NULL))
# Skip evaluation if model training failed
if (is.null(fit)) return(NULL)
# Extract CV log loss
cv_logloss <- min(fit$results$logLoss, na.rm = TRUE)
# Compute predicted probabilities on validation set
predictions <- predict(fit, validate, type = "prob")[, "Yes"]
# Compute validation log loss
validation_logloss <- calculate_logloss(validate$fracture, predictions)
# Initialize result storage
best_result <- NULL
# Loop through thresholds
for (threshold in thresholds) {
preds <- factor(ifelse(predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))
cm <- confusionMatrix(data = preds, reference = validate$fracture, positive = "Yes")
# Calculate additional metrics
auroc <- auc(validate$fracture == "Yes", predictions)
ppv <- cm$byClass["Pos Pred Value"]
sensitivity <- cm$byClass["Sensitivity"]
specificity <- cm$byClass["Specificity"]
# Filter for models meeting Sensitivity >= 0.8
if (sensitivity >= 0.8) {
result <- data.frame(
Model = model,
FormulaName = formula_name, # Use formula name instead of the formula itself
Threshold = threshold,
AUROC = auroc,
CV_LogLoss = cv_logloss,
Validation_LogLoss = validation_logloss,
Sensitivity = sensitivity,
Specificity = specificity,
PPV = ppv,
NPV = cm$byClass["Neg Pred Value"],
Prevalence = cm$byClass["Prevalence"]
)
# Keep only the best (highest PPV) model for this formula
if (is.null(best_result) || result$PPV > best_result$PPV) {
best_result <- result
}
}
}
return(best_result)
}
# Initialize results storage
lda_results <- data.frame()
# Test models with LDA
for (formula_name in names(model_formulas)) {
formula <- model_formulas[[formula_name]]
result <- evaluate_model("lda", formula_name, formula, training, validate, thresholds)
if (!is.null(result)) {
lda_results <- rbind(lda_results, result)
}
}
# Ensure AUROC is numeric before sorting
lda_results$AUROC <- as.numeric(lda_results$AUROC)
# Organize results by AUROC (desc), then Sensitivity (desc), then PPV (desc)
lda_results <- lda_results[order(-lda_results$AUROC, -lda_results$Sensitivity, -lda_results$PPV), ]
top_models <- head(lda_results, n = 3) # Select top 3 models
print(lda_results)
Model FormulaName Threshold AUROC CV_LogLoss
Sensitivity3 lda fit2b 0.20 0.7471042 0.5389635
Sensitivity4 lda fit4b 0.20 0.7451737 0.5366474
Sensitivity lda obj1fit 0.17 0.7415541 0.5515056
Sensitivity2 lda fit_num3 0.21 0.7094595 0.5407943
Sensitivity1 lda fit_num1 0.18 0.6982384 0.5329027
Validation_LogLoss Sensitivity Specificity PPV NPV
Sensitivity3 0.5144550 0.8108108 0.6517857 0.4347826 0.9125000
Sensitivity4 0.5151788 0.8378378 0.6428571 0.4366197 0.9230769
Sensitivity 0.5157334 0.8108108 0.5535714 0.3750000 0.8985507
Sensitivity2 0.5123398 0.8108108 0.5982143 0.4000000 0.9054054
Sensitivity1 0.5170010 0.8108108 0.4642857 0.3333333 0.8813559
Prevalence
Sensitivity3 0.2483221
Sensitivity4 0.2483221
Sensitivity 0.2483221
Sensitivity2 0.2483221
Sensitivity1 0.2483221
# # Test top models with QDA
# qda_results <- data.frame()
# for (i in 1:nrow(top_models)) {
# # Retrieve the formula using the formula name
# formula_name <- top_models$FormulaName[i]
# formula <- model_formulas[[formula_name]]
#
# # Evaluate the model with QDA
# result <- evaluate_model("qda", formula_name, formula, training, validate, thresholds)
# if (!is.null(result)) {
# qda_results <- rbind(qda_results, result)
# }
# }
#
# # Organize QDA results by AUROC (desc), Sensitivity (desc), and PPV (desc)
# qda_results$AUROC <- as.numeric(qda_results$AUROC)
# qda_results <- qda_results[order(-qda_results$AUROC, -qda_results$Sensitivity, -qda_results$PPV), ]
# print(qda_results)
# Additional considerations
# Interaction terms: Consider them if domain knowledge suggests potential interactions.
# Composite variables: Create if logical combinations or transformations improve interpretability or predictive power.
The LDA models performed better than the QDA models.
Here, I explored how LDA would perform with the PCs rather than original numerical predictors.
# Explore LDA with the PCs from PCA
set.seed(1234)
# Split pc.scores into training and validation sets
trainIndex <- createDataPartition(pc.scores$fracture, p = 0.7, list = FALSE) # 70% training, 30% validation
pca.training <- pc.scores[trainIndex, ]
pca.validation <- pc.scores[-trainIndex, ]
# Check structure of training and validation datasets
str(pca.training)
'data.frame': 351 obs. of 6 variables:
$ fracture: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ PC1 : num -1.452 1.595 0.209 2.748 -0.786 ...
$ PC2 : num 0.457 0.438 -0.234 0.422 -2.034 ...
$ PC3 : num 0.5601 -0.1976 0.0262 1.1689 -0.3956 ...
$ PC4 : num 0.27 0.657 -0.229 0.472 0.146 ...
$ PC5 : num 0.01149 -0.0098 0.00583 -0.08762 0.01833 ...
str(pca.validation)
'data.frame': 149 obs. of 6 variables:
$ fracture: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ PC1 : num -0.717 3.756 -0.577 3.435 2.187 ...
$ PC2 : num -0.872 1.451 -0.802 -0.233 1.271 ...
$ PC3 : num 0.823 -0.351 1.791 0.535 0.247 ...
$ PC4 : num 0.2202 -0.5973 0.1291 -0.0211 0.421 ...
$ PC5 : num -0.00188 -0.02011 0.01323 -0.11383 -0.01166 ...
# Define trainControl for LDA
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss)
# Fit the LDA model using PC1 and PC2 (adjust as needed)
fit.pca <- train(
fracture ~ PC1 + PC2, # Use relevant PCs
data = pca.training,
method = "lda",
trControl = fitControl,
metric = "logLoss"
)
# Predict probabilities on validation set
pca.predictions <- predict(fit.pca, pca.validation, type = "prob")[, "Yes"]
# Apply a threshold to classify
threshold <- 0.5 # Adjust threshold as needed
pca.preds <- factor(ifelse(pca.predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))
# Compute confusion matrix
pca.cm <- confusionMatrix(data = pca.preds, reference = pca.validation$fracture, positive = "Yes")
print(pca.cm)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 109 35
Yes 3 2
Accuracy : 0.745
95% CI : (0.6672, 0.8128)
No Information Rate : 0.7517
P-Value [Acc > NIR] : 0.6175
Kappa : 0.0384
Mcnemar's Test P-Value : 4.934e-07
Sensitivity : 0.05405
Specificity : 0.97321
Pos Pred Value : 0.40000
Neg Pred Value : 0.75694
Prevalence : 0.24832
Detection Rate : 0.01342
Detection Prevalence : 0.03356
Balanced Accuracy : 0.51363
'Positive' Class : Yes
# Print the model and important metrics
print(fit.pca)
Linear Discriminant Analysis
351 samples
2 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 315, 316, 316, 316, 316, 316, ...
Resampling results:
logLoss
0.5371827
# Predict probabilities on validation set
pca.predictions <- predict(fit.pca, pca.validation, type = "prob")[, "Yes"]
# Compute AUROC
pca.roc <- roc(pca.validation$fracture, pca.predictions, levels = c("No", "Yes"))
auroc <- auc(pca.roc)
print(paste("AUROC:", round(auroc, 3)))
[1] "AUROC: 0.677"
# Initialize a data frame to store metrics at different thresholds
thresholds <- seq(0.01, 0.99, by = 0.01)
results <- data.frame(
Threshold = thresholds,
Sensitivity = NA,
Specificity = NA,
PPV = NA,
NPV = NA
)
# Loop through thresholds to calculate metrics
for (i in seq_along(thresholds)) {
threshold <- thresholds[i]
pca.preds <- factor(ifelse(pca.predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))
cm <- confusionMatrix(data = pca.preds, reference = pca.validation$fracture, positive = "Yes")
# Store metrics
results$Sensitivity[i] <- cm$byClass["Sensitivity"]
results$Specificity[i] <- cm$byClass["Specificity"]
results$PPV[i] <- cm$byClass["Pos Pred Value"]
results$NPV[i] <- cm$byClass["Neg Pred Value"]
}
# Find threshold with at least 0.8 sensitivity and maximum PPV
optimal_threshold <- results %>%
filter(Sensitivity >= 0.8) %>%
arrange(desc(PPV)) %>%
slice(1)
print(optimal_threshold)
Threshold Sensitivity Specificity PPV NPV
1 0.2 0.8108108 0.5089286 0.3529412 0.890625
# Apply the optimal threshold to classify
final_threshold <- optimal_threshold$Threshold
pca.final.preds <- factor(ifelse(pca.predictions > final_threshold, "Yes", "No"), levels = c("No", "Yes"))
final_cm <- confusionMatrix(data = pca.final.preds, reference = pca.validation$fracture, positive = "Yes")
# Print final confusion matrix and metrics
print(final_cm)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 57 7
Yes 55 30
Accuracy : 0.5839
95% CI : (0.5004, 0.664)
No Information Rate : 0.7517
P-Value [Acc > NIR] : 1
Kappa : 0.2229
Mcnemar's Test P-Value : 2.387e-09
Sensitivity : 0.8108
Specificity : 0.5089
Pos Pred Value : 0.3529
Neg Pred Value : 0.8906
Prevalence : 0.2483
Detection Rate : 0.2013
Detection Prevalence : 0.5705
Balanced Accuracy : 0.6599
'Positive' Class : Yes
print(paste("Final Threshold:", final_threshold))
[1] "Final Threshold: 0.2"
There didn’t seem to be an improvement.
Here, I will fit the final LDA model, adjust the threshold, and get the confusion matrix and metrics.
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)
lda.fit<-train(fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
data=training,
method="lda",
trControl=fitControl,
metric="logLoss")
#Computing predicted probabilities on the validation set
predictions <- predict(lda.fit, validate, type = "prob")[,"Yes"]
# Change the threshold to achieve high PPV with sensitivity over 0.8
threshold=0.20
lda.preds<-factor(ifelse(predictions>threshold,"Yes","No"),levels=c("Yes","No"))
confusionMatrix(data = lda.preds, reference = validate$fracture, positive="Yes")
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 72 6
Yes 40 31
Accuracy : 0.6913
95% CI : (0.6105, 0.7643)
No Information Rate : 0.7517
P-Value [Acc > NIR] : 0.9617
Kappa : 0.3676
Mcnemar's Test P-Value : 1.141e-06
Sensitivity : 0.8378
Specificity : 0.6429
Pos Pred Value : 0.4366
Neg Pred Value : 0.9231
Prevalence : 0.2483
Detection Rate : 0.2081
Detection Prevalence : 0.4765
Balanced Accuracy : 0.7403
'Positive' Class : Yes
LDA is likely not the best tool for this dataset with only 5 numerical predictors. Dummetizing the categorical predictors improves the performance. However, it weakens the assumption of multivariate normality.
The formula of the final Objective 2 LDA model is:
fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu
(fit4b).
With a threshold of 0.20 to prioritize treating high risk individuals
while maintain reasonable avoidance of unnecessary intervention for low
risk individuals, the model achieves the following metrics:
- AUROC: 0.75 - Sensitivity: 0.84
- Specificity: 0.64 - PPV: 0.44
- NPV: 0.92
- Prevalence: 0.25
- Logloss Validation: 0.52 - Logloss CV: 0.54
Sticking only with numerical predictors, using all numerical
predictors from the full dataset results in a model that performs nearly
as well as dummetizing variables. That model is:
fracture ~ bmi + fracscore + age + weight + height.
With a threshold of 0.21, the model achieves the following
metrics:
- AUROC: 0.71 - Sensitivity: 0.81
- Specificity: 0.60 - PPV: 0.40
- NPV: 0.91
- Prevalence: 0.25
- Logloss Validation: 0.51 - Logloss CV: 0.54
Here I experimented with plotting the decision boundary for two of the more important predictors just out of curiousity.
# Set up control and validation grid
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss)
set.seed(1234)
# Model with dummetized variables
threshold = 0.2
lda.fit<-train(fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
data=training,
method="lda",
trControl=fitControl,
metric="logLoss")
# Model with only numerical variables
# threshold = 0.21
# lda.fit<-train(fracture ~ bmi + fracscore + age + weight + height,
# data=training,
# method="lda",
# trControl=fitControl,
# metric="logLoss")
#Computing predicted probabilities on the validation set
predictions <- predict(lda.fit, validate, type = "prob")[,"Yes"]
class_pred <- as.factor(ifelse(predictions > threshold, "1", "0"))
color_array <- c("red3", "blue3")[as.numeric(class_pred)]
# Plot data and decision boundary
plot(training$height, training$fracscore, col = training$fracture, asp = 1, pch = 16, cex = 0.7, xlab = "height", ylab = "fracscore")
Here, I fit a random forest model with all of my original variables to get the variable importance scores. I will use them with previous model information to test for the best performance.
# Build a random forest model to look at variable importance
set.seed(1234)
train_control = trainControl(method = "repeatedcv", number = 10, repeats = 1)
# Define the grid for tuning 'mtry'
tune_grid = expand.grid(
.mtry = c(2:6) # Number of variables randomly sampled as candidates at each split (good start p/3)
)
# str(bone_subset %>% dplyr::select(-site_id, -phy_id, -bmi, -bonetreat))
df = training %>% dplyr::select(-site_id, -phy_id, -bmi, -bonetreat)
# Fit a random forest model (bagging is inherent in random forest)
model_rf = train(
fracture ~ .,
data = df,
method = "rf", # Use 'randomForest' method
trControl = train_control,
tuneGrid = tune_grid,
# tunelength = 3,
ntree = 100, # Specify the number of trees here
importance = TRUE # Compute variable importance
)
# Print the best model parameters
print(model_rf)
Random Forest
351 samples
12 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 316, 317, 315, 315, 316, 316, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7266293 0.01657723
3 0.7238515 0.07137503
4 0.7181373 0.06325066
5 0.7211578 0.07931540
6 0.7211531 0.07738378
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
# Plot the performance across different parameter combinations
plot(model_rf)
# Get the importance of variables
var_importance = varImp(model_rf)
# Print the importance scores
print(var_importance)
rf variable importance
Importance
fracscore 100.00
priorfracYes 92.72
momfracYes 88.13
bonemed_fuYes 87.63
age 83.75
height 77.43
bonemedYes 72.63
rateriskSame 59.29
smokeYes 52.33
weight 44.35
rateriskGreater 30.62
premenoYes 11.88
armassistYes 0.00
# Plot the variable importance
plot(var_importance)
Here, I try different predictors in a random forest model to test for the best performance.
set.seed(1234)
# Define training control for cross-validation
train_control <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = TRUE,
summaryFunction = mnLogLoss
)
# Define the grid for tuning 'mtry'
tune_grid <- expand.grid(
.mtry = 2:6 # Number of variables randomly sampled as candidates at each split
)
# Define predictor combinations
predictor_sets <- list(
obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
fit2b = fracture ~ priorfrac + weight + height + momfrac + smoke + raterisk + fracscore + bonemed_fu,
fit4b = fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu, #ldafit
Multiplicative = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
obj2fit = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
Quadratic = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed,
fit_imp1 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight,
fit_imp2 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + age + weight,
fit_imp3 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk, # chosen
fit_imp3b = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk + priorfrac:bonemed + priorfrac:weight + smoke:bonemed + raterisk:bonemed_fu
# fit_imp4 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk + momfrac
)
# Initialize results storage
results <- data.frame(
Predictor_Set = character(),
Threshold = numeric(),
Sensitivity = numeric(),
Specificity = numeric(),
PPV = numeric(),
NPV = numeric(),
AUROC = numeric(),
LogLoss_Validation = numeric(),
LogLoss_CV = numeric(),
stringsAsFactors = FALSE
)
# Loop through predictor combinations
for (predictor_name in names(predictor_sets)) {
formula <- predictor_sets[[predictor_name]]
# Train the Random Forest model
model_rf <- train(
formula,
data = training,
method = "rf",
trControl = train_control,
tuneGrid = tune_grid,
metric = "logLoss",
ntree = 100,
importance = TRUE
)
# Compute cross-validation log loss
cv_logloss <- min(model_rf$results$logLoss)
# Predict probabilities on validation set
pred_probs <- predict(model_rf, validate, type = "prob")[, "Yes"]
# Compute validation log loss
calculate_logloss <- function(obs, pred_probs) {
epsilon <- 1e-15
pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
-mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
}
validation_logloss <- calculate_logloss(validate$fracture, pred_probs)
# Compute AUROC
roc_curve <- roc(validate$fracture, pred_probs, levels = c("No", "Yes"))
auroc <- auc(roc_curve)
# Optimize threshold for Sensitivity ≥ 0.8 and highest PPV
thresholds <- seq(0, 1, by = 0.01)
stats <- data.frame(
Threshold = numeric(),
Sensitivity = numeric(),
Specificity = numeric(),
PPV = numeric(),
NPV = numeric()
)
for (threshold in thresholds) {
class_pred <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
cm <- confusionMatrix(class_pred, validate$fracture, positive = "Yes")
stats <- rbind(stats, data.frame(
Threshold = threshold,
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
PPV = cm$byClass["Pos Pred Value"],
NPV = cm$byClass["Neg Pred Value"]
))
}
# Filter for Sensitivity ≥ 0.8 and find the best threshold
best_stats <- stats %>% filter(Sensitivity >= 0.8) %>% arrange(desc(PPV))
if (nrow(best_stats) > 0) {
best_threshold <- best_stats$Threshold[1]
best_row <- best_stats[1, ]
# Store the results for this predictor set
results <- rbind(results, data.frame(
Predictor_Set = predictor_name,
Threshold = best_threshold,
Sensitivity = best_row$Sensitivity,
Specificity = best_row$Specificity,
PPV = best_row$PPV,
NPV = best_row$NPV,
AUROC = auroc,
LogLoss_Validation = validation_logloss,
LogLoss_CV = cv_logloss
))
} else {
# If no threshold meets the sensitivity requirement, store NA for this predictor set
results <- rbind(results, data.frame(
Predictor_Set = predictor_name,
Threshold = NA,
Sensitivity = NA,
Specificity = NA,
PPV = NA,
NPV = NA,
AUROC = auroc,
LogLoss_Validation = validation_logloss,
LogLoss_CV = cv_logloss
))
}
}
# Print the results
results <- results[order(-as.numeric(results$AUROC), -results$Sensitivity, -results$PPV), ]
print(results)
Predictor_Set Threshold Sensitivity Specificity PPV NPV
5 obj2fit 0.13 0.8648649 0.4642857 0.3478261 0.9122807
4 Multiplicative 0.11 0.8108108 0.4196429 0.3157895 0.8703704
10 fit_imp3b 0.12 0.8378378 0.4196429 0.3229167 0.8867925
3 fit4b 0.11 0.9189189 0.3303571 0.3119266 0.9250000
2 fit2b 0.12 0.8648649 0.4017857 0.3232323 0.9000000
9 fit_imp3 0.16 0.8108108 0.4285714 0.3191489 0.8727273
1 obj1fit 0.18 0.8648649 0.4285714 0.3333333 0.9056604
6 Quadratic 0.10 0.8378378 0.3839286 0.3100000 0.8775510
8 fit_imp2 0.16 0.8108108 0.4553571 0.3296703 0.8793103
7 fit_imp1 0.15 0.8108108 0.4107143 0.3125000 0.8679245
AUROC LogLoss_Validation LogLoss_CV
5 0.7098214 0.7234368 0.6670635
4 0.6951014 0.5368528 0.6637025
10 0.6794160 0.5489315 0.5581826
3 0.6755550 0.5437283 0.5860024
2 0.6651786 0.5517397 0.6682615
9 0.6637307 0.5445528 0.5513502
1 0.6619208 0.7488199 0.5657350
6 0.6403234 0.7856626 0.7681333
8 0.6323600 0.5712087 0.6611917
7 0.6164334 0.5766410 0.6884587
finalAUROC = results$AUROC
I chose the predictors for my random forest model using the predictors with an importance over 10% (with the exception of age which was highly correlated with fracscore). The chosen model results in a comparatively low validation logloss that is also similar to its cv logloss. The AUROC and PPV are also comparatively high with a sensitivity of over .80 with a threshold of 0.16.
The formula of the final Objective 2 non-parametric, Random Forest
model is:
fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk
(fit_imp3).
With a threshold of 0.16 to prioritize treating high risk individuals
while maintain reasonable avoidance of unnecessary intervention for low
risk individuals, the model achieves the following metrics:
- AUROC: 0.68 (This value might change even with a random seed, because
RF makes different decisions trees each time.) - Sensitivity: 0.81
- Specificity: 0.50 - PPV: 0.35
- NPV: 0.89
- Prevalence: 0.25
- Logloss Validation: 0.54 - Logloss CV: 0.58
This plots an ROC curve for all the models.
# Fit the models first
set.seed(1234)
logistic_obj1 <- glm(
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
data = training,
family = "binomial"
)
logistic_obj2 <- glm(
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
data = training,
family = "binomial"
)
lda_model <- train(
fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
data = training,
method = "lda",
trControl = trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = mnLogLoss),
metric = "logLoss"
)
rf_model <- train(
fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk,
data = training,
method = "rf",
trControl = trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = mnLogLoss),
tuneGrid = expand.grid(.mtry = 2:6),
ntree = 100,
metric = "logLoss"
)
# Predicting probabilities on the validation data
obj1.predprobs <- predict(logistic_obj1, validate, type = "response")
obj2_log.predprobs <- predict(logistic_obj2, validate, type = "response")
lda.predprobs <- predict(lda_model, validate, type = "prob")[, "Yes"]
rf.predprobs <- predict(rf_model, validate, type = "prob")[, "Yes"]
# Calculate ROC and AUROC
obj1.roc <- roc(response = validate$fracture, predictor = obj1.predprobs, levels = c("No", "Yes"))
obj2_log.roc <- roc(response = validate$fracture, predictor = obj2_log.predprobs, levels = c("No", "Yes"))
lda.roc <- roc(response = validate$fracture, predictor = lda.predprobs, levels = c("No", "Yes"))
rf.roc <- roc(response = validate$fracture, predictor = rf.predprobs, levels = c("No", "Yes"))
# Extract AUROC values
obj1.auc <- round(auc(obj1fit.roc), 3)
obj2_log.auc <- round(auc(obj2_log.roc), 3)
lda.auc <- round(auc(lda.roc), 3)
rf.auc <- round(auc(rf.roc), 3)
finalAUROC = c(obj1.auc, obj2_log.auc, lda.auc, rf.auc)
# Save plot to a PNG file
# png("figuresKH/ROC_Curves_with_AUROC.png", width = 800, height = 600)
# Plot ROC curves
plot(obj1.roc, col = "purple3", lwd = 2, lty = 1, main = "ROC Curves for Candidate Models") #print.thres="best"
plot(obj2_log.roc, col = "green4", lwd = 2, lty = 1, add = TRUE)
plot(lda.roc, col = "red3", lwd = 2, lty = 1, add = TRUE)
plot(rf.roc, col = "blue2", lwd = 2, lty = 1, add = TRUE)
# Add legend with AUROC values
legend("bottomright",
legend = c(
paste("Log Obj1 (AUROC =", finalAUROC[1], ")"),
paste("Log Obj2 (AUROC =", finalAUROC[2], ")"),
paste("LDA Obj2 (AUROC =", finalAUROC[3], ")"),
paste("RF Obj2 (AUROC =", finalAUROC[4], ")")
),
col = c("purple3","green4", "red3", "blue2"),
lwd = 2, lty = c(1, 1, 1, 1), cex = 1, xpd = TRUE, horiz = FALSE)
# Close the graphics device
# dev.off()
Here, I use the above fitted models to plot a bias-variance curve.
# Bias-Variance Tradeoff
# Define pre-fitted models
models <- list(
"Logistic (additive)" = logistic_obj1,
"Logistic (with interactions)" = logistic_obj2,
"LDA" = lda_model,
"Random Forest" = rf_model
)
# Initialize an empty data frame to store errors and metrics
errors <- data.frame()
# Compute errors and metrics
errors <- data.frame()
# Iterate over model indices
for (i in seq_along(models)) {
model_name <- names(models)[i]
model <- models[[model_name]]
# Predict on training and validation sets
if (inherits(model, "train")) {
train_preds <- predict(model, training, type = "prob")[, "Yes"]
val_preds <- predict(model, validate, type = "prob")[, "Yes"]
} else {
train_preds <- predict(model, training, type = "response")
val_preds <- predict(model, validate, type = "response")
}
# Compute metrics
train_error <- logLoss(actual = ifelse(training$fracture == "Yes", 1, 0), predicted = train_preds)
val_error <- logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = val_preds)
roc_obj <- roc(response = validate$fracture, predictor = val_preds, levels = c("No", "Yes"))
roc_auc <- as.numeric(auc(roc_obj))
# Store results
errors <- rbind(errors, data.frame(
Model = model_name,
TrainingError = train_error,
ValidationError = val_error,
ROC = finalAUROC[i] # Use the corresponding value from finalAUROC
))
}
# Set factor levels for the models to match the order in the list
errors$Model <- factor(errors$Model, levels = names(models))
# Plot Bias-Variance Tradeoff
bv_plot <- ggplot(errors, aes(x = Model)) +
geom_line(aes(y = TrainingError, color = "Training Log Loss"), group = 1, linewidth = 1) +
geom_line(aes(y = ValidationError, color = "Validation Log Loss"), group = 1, linewidth = 1) +
geom_point(aes(y = TrainingError, color = "Training Log Loss"), size = 2) +
geom_point(aes(y = ValidationError, color = "Validation Log Loss"), size = 2) +
geom_line(aes(y = ROC, color = "AUROC"), group = 1, linewidth = 1, linetype = "dotted") +
geom_point(aes(y = ROC, color = "AUROC"), size = 2) +
labs(
title = "Bias-Variance Tradeoff Across Models",
x = "Model",
y = "Metric",
color = "Metric Type"
) +
scale_x_discrete(labels = function(x) gsub(" \\(", "\n(", x)) +
scale_color_manual(values = c("gray30", "turquoise4", "darkorange3")) +
theme_bw()
print(bv_plot)
ggsave(plot = bv_plot + theme(text = element_text(size = 14)),
filename = "~/Desktop/bv_v2_3by2.png",
width = 9, height = 6, dpi = 200)
This produces a table with the metrics for each model for comparison.
# Function to calculate metrics
calculate_metrics <- function(model_name, model_fit, data_train, data_validate, threshold) {
# Predict probabilities on the validation data
pred_probs <- predict(model_fit, newdata = data_validate, type = "prob")[, "Yes"]
# Generate class predictions based on the threshold
class_preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
# Compute confusion matrix
cm <- confusionMatrix(data = class_preds, reference = data_validate$fracture, positive = "Yes")
# Calculate AUROC
roc_curve <- roc(data_validate$fracture, pred_probs, levels = c("No", "Yes"))
auroc <- auc(roc_curve)
# Compute validation log loss
calculate_logloss <- function(obs, pred_probs) {
epsilon <- 1e-15
pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
-mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
}
val_logloss <- calculate_logloss(data_validate$fracture, pred_probs)
# Compute CV log loss if available
cv_logloss <- if (!is.null(model_fit$resample$logLoss)) {
mean(model_fit$resample$logLoss, na.rm = TRUE)
} else {
NA
}
# Compute additional metrics
accuracy <- cm$overall["Accuracy"]
sensitivity <- cm$byClass["Sensitivity"]
specificity <- cm$byClass["Specificity"]
ppv <- cm$byClass["Pos Pred Value"]
npv <- cm$byClass["Neg Pred Value"]
f1 <- ifelse(sensitivity + ppv > 0, 2 * (sensitivity * ppv) / (sensitivity + ppv), NA)
prevalence <- cm$byClass["Prevalence"]
# Return metrics as a data frame row
data.frame(
Model = model_name,
AUROC = round(auroc, 3),
Sensitivity = round(sensitivity, 3),
Specificity = round(specificity, 3),
PPV = round(ppv, 3),
NPV = round(npv, 3),
Accuracy = round(accuracy, 3),
F1_Score = round(f1, 3),
Threshold = threshold,
Prevalence = round(prevalence, 3),
Validation_LogLoss = round(val_logloss, 3),
CV_LogLoss = round(cv_logloss, 3)
)
}
# Define models manually
set.seed(1234)
logistic_obj1 <- train(
fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
data = training,
method = "glm",
family = "binomial",
trControl = trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = mnLogLoss
),
metric = "logLoss"
)
logistic_obj2 <- train(
fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu +
priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
data = training,
method = "glm",
family = "binomial",
trControl = trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = mnLogLoss
),
metric = "logLoss"
)
lda_model <- train(
fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
data = training,
method = "lda",
trControl = trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = mnLogLoss
),
metric = "logLoss"
)
rf_model <- train(
fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk,
data = training,
method = "rf",
trControl = trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = mnLogLoss
),
tuneGrid = expand.grid(.mtry = 2:6),
ntree = 100,
metric = "logLoss"
)
# Define thresholds for each model
thresholds <- list(
"Logistic Obj1" = 0.19,
"Logistic Obj2" = 0.20,
"LDA" = 0.20,
"Random Forest" = 0.16
)
# Collect metrics for all models
results <- rbind(
calculate_metrics("Logistic Obj1", logistic_obj1, training, validate, thresholds[["Logistic Obj1"]]),
calculate_metrics("Logistic Obj2", logistic_obj2, training, validate, thresholds[["Logistic Obj2"]]),
calculate_metrics("LDA", lda_model, training, validate, thresholds[["LDA"]]),
calculate_metrics("Random Forest", rf_model, training, validate, thresholds[["Random Forest"]])
)
results
Model AUROC Sensitivity Specificity PPV NPV Accuracy
Sensitivity Logistic Obj1 0.735 0.865 0.580 0.405 0.929 0.651
Sensitivity1 Logistic Obj2 0.763 0.811 0.643 0.429 0.911 0.685
Sensitivity2 LDA 0.745 0.838 0.643 0.437 0.923 0.691
Sensitivity3 Random Forest 0.679 0.784 0.446 0.319 0.862 0.530
F1_Score Threshold Prevalence Validation_LogLoss CV_LogLoss
Sensitivity 0.552 0.19 0.248 0.517 0.533
Sensitivity1 0.561 0.20 0.248 0.500 0.616
Sensitivity2 0.574 0.20 0.248 0.515 0.540
Sensitivity3 0.453 0.16 0.248 0.549 0.576
# # Format the results into a gt table
# results_table <- results %>%
# gt() %>%
# tab_header(
# title = "Comparison of Model Performance"
# ) %>%
# cols_label(
# Model = "Model Name",
# AUROC = "AUROC",
# Sensitivity = "Sensitivity",
# Specificity = "Specificity",
# PPV = "PPV",
# NPV = "NPV",
# Accuracy = "Accuracy",
# F1_Score = "F1 Score",
# Threshold = "Threshold",
# Prevalence = "Prevalence",
# Validation_LogLoss = "Validation Log Loss",
# CV_LogLoss = "CV Log Loss"
# )
#
# # Display the table
# print(results_table)
Summary of the findings from Objective 1
Fracture score and maternal history of fracture are statistically and
practically significant predictors of fracture risk. Prior fracture
history and height are also practically significant.
Summary of the findings and recommendations in Objective
2
Complex logistic regression is the best model, with the highest AUROC
(0.76) and lowest validation logloss (0.50). It the best choice for
prediction.
Additive logistic regression offers slightly lower performance but
better interpretability. LDA and Random Forest underperform and are not
recommended without further tuning.
Additional discussion points:
Scope of inference?
This study includes 500 women aged 55 and older, who were attended by
325 physicians across six study sites. It appears that this dataset is a
subset of patients from a larger study, the Global Longitudinal Study of
Osteoporosis in Women (GLOW), which included over 16,000 women at sites
worldwide. Based on the study design, it is likely an observational
study rather than a randomized controlled experiment or a retrospective
study with a controlled distribution of response classes. Due to ethical
and practical considerations, we cannot randomly assign participants to
specific treatments or medical histories. Therefore, we can infer
associations between features and the odds of experiencing a fracture,
but we cannot establish causation or quantify changes in the probability
of fracture.
Additionally, it is unclear whether this sample of women was randomly selected or if participants volunteered. This limitation restricts the generalizability of the findings. The results should be interpreted within the context of the study population, women at the participating study sites within the specified age range. Despite these limitations, the findings offer valuable insights and provide a foundation for further research.
Future Work
Validation: Test models on new, independent data to assess performance
and revalidate at future time points.
Model Updates: Investigate potential modifications if model performance
declines over time.
Leverage GLOW: Explore findings from the Global Longitudinal Study of
Osteoporosis in Women (GLOW) for insights into data and statistical
approaches.
Enhance Random Forest: Further optimize tuning parameters to improve
performance.
Why one method performs better than the other for objective
2?
LDA is primarily useful for numeric data as it assumes the predictors
are multivariate normal. Though categorical variables can be converted
to dummy variables, the metrics may not be reliable when the assumptions
aren’t met. Though Random Forest and other non-parametric methods may be
ultimately useful, I may need to increase both my breadth and depth of
knowledge to both choose the best tool and to tune it optimally. Random
Forest for this small dataset may not be the best choice also because it
may tend to overfit to the limited training data. Logistic regression
seems like our best choice.
Why are one or two metrics more important/informative
compared to the others?
Prioritizing sensitivity ensures identifying most women at risk, while
PPV minimizes unnecessary treatments by ensuring positive
classifications are accurate.
AUROC and logloss are valuable for comparing model performance across
all thresholds, providing a threshold-independent assessment of model
quality.