Project2_Individual

Introduction and Objective Summary

This project aims to understand the risk of bone fractures for women with osteoporosis.

The first objective is to perform an exploratory data analysis and produce a logistic regression model to identify and interpret factors important in determining the probability of suffering a fracture in the first year of study enrollment.

The second objective is to produce three additional models that prioritize prediction performance, including complex logistic regression, discriminant analysis, and non-parametric models.

The models will be compared using appropriate metrics, such as sensitivity, specificity, prevalence, PPV, NPV, AUROC and the logloss error.

This loads the necessary libraries.

# install.packages("aplore3")
library(aplore3) #data set
library(tidyverse) #included ggplot2,dplyr,tidyr,stringr,forcats
library(gt) #tables
library(DataExplorer) #plot_missing()
library(naniar) #vis_miss()
library(Hmisc)
library(RColorBrewer)
library(gridExtra)
library(GGally)
library(corrplot)
library(pheatmap)
library(caret)
library(car) #vif()
library(Metrics)
library(pROC)
library(ResourceSelection)  # For hoslem.test
library(reshape2)
library(sjPlot) #effects plots
library(sjmisc) #effects plots
library(dplyr) #loading last so select() works

This imports and quickly views the data.

# Import the data
bone = glow_bonemed

# View the data
str(bone)

'data.frame':   500 obs. of  18 variables:
 $ sub_id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ site_id   : int  1 4 6 6 1 5 5 1 1 4 ...
 $ phy_id    : int  14 284 305 309 37 299 302 36 8 282 ...
 $ priorfrac : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 2 1 2 2 1 ...
 $ age       : int  62 65 88 82 61 67 84 82 86 58 ...
 $ weight    : num  70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
 $ height    : int  158 160 157 160 152 161 150 153 156 166 ...
 $ bmi       : num  28.2 34 20.6 24.3 29.4 ...
 $ premeno   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ momfrac   : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ armassist : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ smoke     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ raterisk  : Factor w/ 3 levels "Less","Same",..: 2 2 1 1 2 2 1 2 2 1 ...
 $ fracscore : int  1 2 11 5 1 4 6 7 7 0 ...
 $ fracture  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ bonemed   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ bonemed_fu: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ bonetreat : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...

# ?glow_bonemed

Data Description / Processing Summary

The data set consists of 18 variables collected on each of 500 patients. The binary categorical response variable fracture, indicates if a patient had a fracture within the first year of enrollment. Each patient has a unique ID and is assigned to a study site and physician. The other 14 variables relate to the medical history or treatment of the patient. Aside from the patient ID, five of the variables are numeric and ten are categorical. The remaining two, site and physician are integer coded categorical variables. A more detailed data description is in the table below.

Variable ID	Variable Name	Data Type	Description
Bone Health Data Description
sub_id	ID	numeric identifier	Unique Identification Code
site_id	Study Site	categorical integer	Study Site (1 - 6)
phy_id	Physician ID	categorical integer	Physician ID Code (128 unique codes)
priorfrac	Prior Fracture	categorical	History of Prior Fracture (1: No, 2: Yes)
age	Age	numeric	Age at Enrollment (years)
weight	Weight	numeric	Weight at Enrollment (kg)
height	Height	numeric	Height at Enrollment (cm)
bmi	BMI	numeric	Body Mass Index (kg/m^2)
premeno	Early Menopause	categorical	Menopause before Age 45 (1: No, 2: Yes)
momfrac	Mom Fracture	categorical	Mother had Hip Fracture (1: No, 2: Yes)
armassist	Arms Assist	categorical	Arms are needed to stand from a chair (1: No, 2: Yes)
smoke	Smoker	categorical	Former or current smoker (1: No, 2: Yes)
raterisk	Self Reported Risk	categorical	Self-reported risk of fracture (1: Less than others, 2: Same as others, 3: Greater than others)
fracscore	Fracture Risk	numeric	Fracture Risk Score (Composite Risk Score)
fracture	Fracture First Year	categorical	Any fracture in first year (1: No, 2: Yes)
bonemed	Meds at Enrollment	categorical	Bone medications at enrollment (1: No, 2: Yes)
bonemed_fu	Meds at Follow-Up	categorical	Bone medications at follow-up (1: No, 2: Yes)
bonetreat	Meds at Both	categorical	Bone medications both at enrollment and follow-up (1: No, 2: Yes)

This looks for missing data.

# Check for missing data
# If any, what kind of missing are they: mcar, mar, nmar?
missing = bone

# Check for empty strings and if any convert to NA
sum(missing == "")

[1] 0

# Check for NAs in each column
colSums(is.na(missing))

    sub_id    site_id     phy_id  priorfrac        age     weight     height 
         0          0          0          0          0          0          0 
       bmi    premeno    momfrac  armassist      smoke   raterisk  fracscore 
         0          0          0          0          0          0          0 
  fracture    bonemed bonemed_fu  bonetreat 
         0          0          0          0

# Plot the percent missing for PPT
miss_plot = plot_missing(missing)

# Save the plot
# ggsave("missing_data.png", miss_plot)

The data set is complete. There are no missing values.

Exploratory Data Analysis

This outputs the summary statistics, including the five number plus mean summaries of the numerical variables and count summaries of the categorical variable levels. It also notes the prevalence of the positive response class.

# Get summary statistics
summary_stats = summary(bone)
print(summary_stats)

     sub_id         site_id          phy_id       priorfrac      age       
 Min.   :  1.0   Min.   :1.000   Min.   :  1.00   No :374   Min.   :55.00  
 1st Qu.:125.8   1st Qu.:2.000   1st Qu.: 57.75   Yes:126   1st Qu.:61.00  
 Median :250.5   Median :3.000   Median :182.50             Median :67.00  
 Mean   :250.5   Mean   :3.436   Mean   :178.55             Mean   :68.56  
 3rd Qu.:375.2   3rd Qu.:5.000   3rd Qu.:298.00             3rd Qu.:76.00  
 Max.   :500.0   Max.   :6.000   Max.   :325.00             Max.   :90.00  
     weight           height           bmi        premeno   momfrac   armassist
 Min.   : 39.90   Min.   :134.0   Min.   :14.88   No :403   No :435   No :312  
 1st Qu.: 59.90   1st Qu.:157.0   1st Qu.:23.27   Yes: 97   Yes: 65   Yes:188  
 Median : 68.00   Median :161.5   Median :26.42                                
 Mean   : 71.82   Mean   :161.4   Mean   :27.55                                
 3rd Qu.: 81.30   3rd Qu.:165.0   3rd Qu.:30.79                                
 Max.   :127.00   Max.   :199.0   Max.   :49.08                                
 smoke        raterisk     fracscore      fracture  bonemed   bonemed_fu
 No :465   Less   :167   Min.   : 0.000   No :375   No :371   No :361   
 Yes: 35   Same   :186   1st Qu.: 2.000   Yes:125   Yes:129   Yes:139   
           Greater:147   Median : 3.000                                 
                         Mean   : 3.698                                 
                         3rd Qu.: 5.000                                 
                         Max.   :11.000                                 
 bonetreat
 No :382  
 Yes:118

prop_frac = (sum(bone$fracture == 'Yes') / length(bone$fracture)) * 100
print(prop_frac)

[1] 25

The prevalence of the positive response class is 25%.

Here is a table of the summary statistics.

Variable	Min	1st Quartile	Median	Mean	3rd Quartile	Max
Summary Statistics for Numeric Variables
sub_id	1.00	125.80	250.50	250.50	375.20	500.00
site_id	1.00	2.00	3.00	3.44	5.00	6.00
phy_id	1.00	57.75	182.50	178.55	298.00	325.00
age	55.00	61.00	67.00	68.56	76.00	90.00
weight	39.90	59.90	68.00	71.82	81.30	127.00
height	134.00	157.00	161.50	161.40	165.00	199.00
bmi	14.88	23.27	26.42	27.55	30.79	49.08
fracscore	0.00	2.00	3.00	3.70	5.00	11.00

Variable	Binary Variables		Multicategory Variable
Summary Statistics for Categorical Variables
Variable	No	Yes	Less	Same	Greater
priorfrac	374	126
premeno	403	97
momfrac	435	65
armassist	312	188
smoke	465	35
fracture	375	125
bonemed	371	129
bonemed_fu	361	139
bonetreat	382	118
raterisk			167	186	147

This makes separate dataframes for the numerical variables, bone_num and for the categorical variables, bone_cat.

# Filter numerical variables
numVars = sapply(bone, is.numeric)  # outputs a logical
numVars["sub_id"] = FALSE  # set patient ID column to FALSE to exclude it
bone_num = bone[, numVars]

# Filter categorical variables (factors)
bone_cat = bone %>%
  select(where(is.factor))

This converts the response levels to numerical values and creates both box plots of the distribution for each level and LOESS plots of the relationship between the predictor and the response.

# Convert the binary response to No = 0, Yes = 1.
bone$fracture.num = ifelse(bone$fracture == "Yes", 1, 0)
# Sanity check
table(bone$fracture.num, bone$fracture)

# Plots of the response versus each predictor variable
# NUMERICAL VARIABLES

# Function to create LOESS plots for numerical variables and response
make_loess = function(numvar, response, data) {
  data %>% 
    ggplot(aes(x = .data[[numvar]], y = .data[[response]])) +
    geom_point(position = position_jitter(height = 0.05), alpha = 0.6) +
    geom_smooth(formula = "y~x", method = "loess", span = 0.8) +
    ylim(-.2,1.2) +
    labs(title = paste(numvar, "vs fracture"),
         y = "proportion with fracture") +
    theme_bw()
}

# Function to create box plots for numerical variables by response category
make_boxplot <- function(numvar, response, data) {
  data %>%
    ggplot(aes(x = .data[[numvar]], y = .data[[response]], fill = .data[[response]])) +
    geom_boxplot(alpha = 0.7) +
    labs(title = paste(numvar, "vs", response)) +
    scale_fill_manual(values = c("blue3", "red3")) + 
    theme_bw() +
    theme(legend.position = 'none')
}

# Create LOESS plots and boxplots for numerical variables vs "fracture"
for (numvar in names(bone_num)) {
    loess_plt <- make_loess(numvar, "fracture.num", bone)
    boxplt = make_boxplot(numvar, "fracture", bone)
    print(loess_plt)
    print(boxplt)
    # ggsave(filename = paste0("figures/loessPlt_", numvar, ".png"), loess_plt)
    # ggsave(filename = paste0("figures/boxPlt_", numvar, ".png"), boxplt)
}

EDA LOESS and box plots suggest that several numerical predictors have separation and informative trends between the response classes.
- age: Probability of fracture appears to increase with age.
- fracscore: Probability of fracture appears to increase with fracscore.
- height: This variable may be influenced by outliers, a short individual in the fracture group and a tall one in the no fracture group. There is a trend of decreasing fracture rate with increasing height but it appears heavily influenced by those two data points.
- bmi: There doesn’t appear to be a trend in this variable.
- weight: The trend in fracture probability appears flat and the distributions of those with and without fractures are similar.
- site_id and phy_id: These categorical variables coded by integers intuitively seem more like metadata than relevant predictors. If being treated as predictors, they likely should be converted to categorical. However, the box plots suggest little to no difference between the distribution in response classes and a flat trend at 25% probability of fracture, which is the prevalence in the response.

This takes a quick look at the potential height outliers/high leverage points and replots the LOESS and boxplot without those values to re-examine the trend.

# Find indices of rows where height < 140 or height > 190
ht_outlier_idx <- which(bone$height < 140 | bone$height > 190)

# View the rows that match the condition
bone[ht_outlier_idx, ]

    sub_id site_id phy_id priorfrac age weight height      bmi premeno momfrac
281    281       4    284        No  57   70.3    199 17.75208      No      No
430    430       2     76        No  77   74.8    134 41.65739      No      No
    armassist smoke raterisk fracscore fracture bonemed bonemed_fu bonetreat
281        No    No  Greater         0       No      No         No        No
430       Yes    No     Same         6      Yes     Yes        Yes       Yes
    fracture.num
281            0
430            1

bone_noHtOut = bone[-ht_outlier_idx, ]

# Create LOESS plots and boxplots
loess_plt = make_loess("height", "fracture.num", bone_noHtOut)
boxplt = make_boxplot("height", "fracture", bone_noHtOut)
print(loess_plt)

print(boxplt)

# ggsave(filename = paste0("figures/loessPlt_heightNoOut.png"), loess_plt)
# ggsave(filename = paste0("figures/boxPlt_heightNoOut.png"), boxplt)

This creates bar plots for each categorical variable with the proportion of fractures for level.

# CATEGORICAL VARIABLES

# Function to get proportions of each response class for categorical variable levels
summarize_props <- function(catvar, response, data) {
  data %>% 
    group_by(.data[[catvar]], .data[[response]]) %>%
    summarise(cnt = n(), .groups = "drop") %>%
    group_by(.data[[catvar]]) %>%  # Group again to calculate proportions within each level of `catvar`
    mutate(perc = round(cnt / sum(cnt), 4)) %>% 
    ungroup() %>%  # Ensure no lingering groupings
    arrange(.data[[response]], .data[[catvar]])
}

# Function to create bar plots displaying proportion of positive response class for each categorical variable level
make_barplot <- function(catvar, response, prop_data) {
  # Filter rows with "Yes" response for standard variables
  if (catvar == "raterisk") {
    plot_data <- prop_data %>% filter(.data[[response]] == "Yes")
  } else {
    plot_data <- prop_data %>% filter(.data[[response]] == "Yes")
  }
  # Create bar plots
  plot_data %>%
    ggplot(aes(x = reorder(.data[[catvar]], -perc), y = perc, fill = .data[[catvar]])) +
    geom_bar(stat = "identity", show.legend = TRUE) +
    ylab(paste("Proportion with", capitalize(response))) +
    xlab(catvar) +
    scale_fill_brewer(palette = "Dark2") +
    theme_bw() +
    theme(legend.position = "none")
}

# Call the functions to calculate proportions and make bar plots for categorical variables
for (catvar in names(bone_cat)[names(bone_cat) != "fracture"]) {
  # Get proportion summaries
  prop_summary <- summarize_props(catvar, "fracture", bone)
  # Create bar plot
  barplt <- make_barplot(catvar, "fracture", prop_summary)
  # Display the plot
  print(barplt)
  # Save the plot
  # ggsave(filename = paste0("figures/barPlt_", catvar, ".png"), barplt)
}

# For debugging:
# prop.summary<-bone %>% 
#   group_by(priorfrac,fracture) %>%
#   summarise(cnt=n()) %>%
#   mutate(perc=round(cnt/sum(cnt),4))%>%
#   arrange(desc(perc))
# prop.summary

EDA bar plots illustrate the categorical predictors that have different proportions of fractures between levels include the following variables.
- priorfrac: Those with a history of prior fracture have approximately a 40% chance of fracture, while those without have less than half the chance at ~ 20%.
- armassist: Those who need support getting out of a chair have over ~30% chance compared to ~20% for those who don’t.
- momfrac: Those whose mother suffered from a hip fracture have over ~35% chance of fracture compared to less than ~25% for others.
- bonemed, bonemed_fu, bonetreat: Those on bone medications in various phases of the study have roughly a 35% chance of fracture versus roughly 20% for those not on meds. This variable is a bit counterintuitive. I suspect it is somewhat confounding. One shouldn’t assume that the meds may by themselves be associated with an increased risk, but rather they are at increased risk likely because of other factors, which is the reason they were prescribed the medications.
- raterisk: The proportion with fracture decreases with decreasing self perceived risk.
- smoke: There is only ~5% difference in percent fracture of those who smoke and those who don’t.
- premeno: The fracture rates look nearly the same regardless of the age at menopause onset.

Based on graphical evidence, the variables that I think are the most likely to influence the probability of fracture are age, fracscore, and priorfrac. Variables which may be useful are armassist, momfrac, raterisk, and height. bonemed, bonemed_fu, and bonetreat may be useful but confounding. The variables which seem unimportant are site_id, phy_id, weight, bmi, premeno, and smoke.

This runs a PCA to examine the influential explanatory variables in each of the PCs and for use in a scatterplot matrix. Also, it is just for practice.

str(bone_num)

'data.frame':   500 obs. of  7 variables:
 $ site_id  : int  1 4 6 6 1 5 5 1 1 4 ...
 $ phy_id   : int  14 284 305 309 37 299 302 36 8 282 ...
 $ age      : int  62 65 88 82 61 67 84 82 86 58 ...
 $ weight   : num  70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
 $ height   : int  158 160 157 160 152 161 150 153 156 166 ...
 $ bmi      : num  28.2 34 20.6 24.3 29.4 ...
 $ fracscore: int  1 2 11 5 1 4 6 7 7 0 ...

# Run PCA on the numeric variables
pc.result<-prcomp(bone_num[, -(1:2)],scale.=TRUE) # remove site_id and phy_id

#Eigen Vectors
pc.result$rotation

                 PC1         PC2         PC3         PC4          PC5
age        0.4947219  0.46742140 -0.15246583  0.71654567 -0.009160237
weight    -0.5273035  0.46578775 -0.08840991  0.03240244 -0.704362523
height    -0.2345770 -0.08196149 -0.93823245  0.01885633  0.240042129
bmi       -0.4741030  0.51615173  0.24533677  0.05137380  0.667820563
fracscore  0.4442985  0.53984137 -0.16872342 -0.69463484  0.013601399

# Append the diagnosis to the new dataset of PCs for each observation
pc.scores = pc.result$x
pc.scores = data.frame(fracture=bone$fracture, pc.scores)
head(pc.scores)

  fracture        PC1        PC2         PC3        PC4          PC5
1       No -0.7166719 -0.8721532  0.82346719  0.2202093 -0.001884542
2       No -1.4516177  0.4570325  0.56013019  0.2704474  0.011489794
3       No  3.7564133  1.4509041 -0.35122087 -0.5973179 -0.020109479
4       No  1.5951482  0.4377183 -0.19760955  0.6571483 -0.009795756
5       No -0.5773744 -0.8020883  1.79077290  0.1290996  0.013233417
6       No  0.2086149 -0.2335424  0.02618618 -0.2285344  0.005830377

# Plot PC1 and PC2 color coded by diagnosis
ggplot(data = pc.scores, aes(x = PC1, y = PC2, color = fracture)) + 
  geom_point() +
  scale_color_manual(values = c("blue","red"))

# Plot the first few PCs in a scatterplot matrix
ggpairs(
  pc.scores,
  aes(color = fracture, alpha = 0.5), 
  columns = c(2:6)) +
  scale_color_manual(values = c("blue","red")) +
  scale_fill_manual(values = c("blue", "red"))

I can’t take much away from this PCA. However, what it seems to suggest is that, unsurprisingly, the categorical predictors are very important in separating the response classes. Also, though I haven’t looked very deeply into the possible outliers in the height variable, height seems unimportant in likelihood of fracture. PC3 is heavily influenced by height, and both PC3 and PC5 have virtually no separation in fracture class distributions. Across the other PCs, height has small loadings and seems unimportant.

Here, I’d like to make some plots to get a sense of potential multicollinearity and/or interactions. First, I plot each numerical and categorical variable combination.

# Look at co-variation between numerical and categorical variables

# Function to create LOESS plots with color split by categorical predictor
make_loess_split <- function(numvar, catvar, response, data) {
  data %>%
    ggplot(aes(x = .data[[numvar]], y = .data[[response]], colour = .data[[catvar]])) +
    geom_point(position = position_jitter(height = 0.05), alpha = 0.6) +  # Add jitter to points
    geom_smooth(method = "loess", formula = "y ~ x", linewidth = 1, span = 0.9) +  # Add LOESS curve
    ylim(-0.3, 1.3) +  # Set consistent y-limits
    labs(title = paste(numvar, "vs", response, "by", catvar),
         x = numvar,
         y = response,
         colour = catvar) +
    scale_color_brewer(palette = "Dark2") +
    theme_bw()
}

# Loop through numerical and categorical predictors
for (numvar in names(bone_num)) {
  for (catvar in names(bone_cat)[names(bone_cat) != "fracture"]) {
    # Create LOESS plot
    loess_plt <- make_loess_split(numvar, catvar, "fracture.num", bone)
    
    # Display the plot
    print(loess_plt)
    
    # Save the plot
    # ggsave(filename = paste0("figures/loessPltCombo_", numvar, "_by_", catvar, ".png"), loess_plt)
  }
}

LOESS plots looking at potential multicollinearity and/or interactions and relationships between pairs of numerical and categorical predictors and the response highlight the following.
- age: Generally, there is not an obvious difference in trends in relationship of age to fracture rate among most of the categorical predictors. There is potentially some interaction in priorfrac. Those with no history of fracture have an increasing trend in rate of fracture with increasing age, but those with a prior fracture have a stable rate of fracture across all ages. Also of note, there are no subjects above 75 who started menopause before they were 45. All of the women between 75 and 90 were over 45 entering menopause. This suggests some multicollinearity, because the age range determines the premeno group.
- fracscore: age and fracscore are highly correlated. Possible interaction between priorfrac and fracscore. With history of fracture, little change in fracscore. With no prior history, increasing rate with increasing fracscore. Seems intuitive. No women with with normal menopause age with fracscore > 7. Fracscore < 2 all no armassist; >7 all with armassist. This suggests multicollinearity between fracscore and premeno and fracscore and armassist.
- height: Possible interaction between smoking and height. Smoking alone doesn’t seem like it has much influence on fracture probability. Height minus the two possible ouliers also doesn’t look very important.
- bmi: Several categorical variables may have differing trend in fracture probability between levels across different bmis. Particularly in bone med variables, raterisk, and priorfrac.
- weight: Weight and BMI are highly correlated. Several categorical variables may have differing trend in fracture probability between levels across different weights. Particularly in bone med variables and priorfrac.
- site_id and phy_id: The lack of distinct trend divergence suggests no interaction between site id or physician id and any of the categorical variables. Any differences in the rate of fractures between categorical levels seem independent of site or physician.

Based on graphical evidence, the variables that I think are safe to drop from the analysis are site_id and phy_id. There are quite a number of potential interactions between numerical and categorical variables to explore: age/fracscore:priorfrac, weight/bmi with other categorical (meds/treatments, raterisk, priorfrac), height:smoke.

Additionally, there are multicollinearity concerns between age/fracscore/premeno and fracscore/armassist.

Two variables are essential interaction terms of other combinations of variable. bonetreat is a combination of bonemed and bonemed_fu. bmi is weight (\(kg\)) divided by height squared (\(m^2\)).

Next, I plot the proportion of fractures in each pairwise combination of categorical predictors.

make_barchart <- function(var1, var2) {
  bone_cat %>%
    group_by(.data[[var1]], .data[[var2]], fracture) %>%  # Group by predictors and response
    summarise(n = n(), .groups = "drop") %>%  # Count occurrences for each group
    group_by(.data[[var1]], .data[[var2]]) %>%  # Group by predictors only
    mutate(total = sum(n),  # Calculate the total count for each group
           prop = n / total,  # Calculate the proportion
           label = paste0(n, "/", total)) %>%  # Create the label as "count/total"
    filter(fracture == "Yes") %>%  # Keep only proportions for "Yes" fractures
    ggplot(aes(x = .data[[var2]], y = prop, fill = .data[[var1]])) +  # x-axis: var2, fill by var1
    geom_bar(stat = "identity", position = "dodge") +  # Use dodge position for comparison
    geom_text(aes(label = label),  # Add the "count/total" label
              position = position_dodge(width = 0.9),
              vjust = -0.5, size = 3) +  # Adjust vertical position and size
    facet_wrap(~fracture, scales = "free", labeller = labeller(fracture = c("Yes" = "Yes Fracture"))) +
    labs(title = paste("Proportion of Fractures (Yes) by", var1, "and", var2),
         x = var2,
         y = "Proportion with Fractures (Yes)",
         fill = var1) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_fill_brewer(palette = "Dark2")
}

catpredictorNames <- names(bone_cat %>% select(-fracture))  # Exclude the response variable

# Loop through all pairs of categorical predictors
for (i in 1:(length(catpredictorNames) - 1)) {
  var1 <- catpredictorNames[i]
    
  for (j in (i + 1):length(catpredictorNames)) {
    var2 <- catpredictorNames[j]
    
    # Generate the bar plot
    barplt <- make_barchart(var1, var2)
    
    # Display the plot
    print(barplt)
    
    # Save the plot
    filename <- paste0("figures/barpltCombo_", var1, "_vs_", var2, ".png")
    # ggsave(filename, plot = barplt, device = "png")
  }
}

For potential interactions, I looked at the patterns between the bars compared across category levels.

Interactions seem likely in the following variables.
- momfrac: armassist, raterisk, bone medication variables
- smoke: raterisk, bone medication variables
- raterisk: bone medication variables
- priorfrac: premeno
Though the visual evidence is not super convincing, it is worth considering interactions between these variables.
- priorfrac: momfrac, bone medication variables
- premeno: raterisk
- momfrac: smoke
Interactions seem unlikely:
- priorfrac: armassist, smoke, raterisk
- premeno: momfrac, armassist, smoke, bone medication variables
- armassist: smoke, raterisk, bone medication variables

There are some variables that may confound one another. Based on the visual evidence, membership in one variables group tells you something about the relationship with the other variable.

Consider these for potential confounders:
- raterisk & smoke
- bone medication variables
Considerations for bonemed/bonemed_fu/bonetreat: Proportions of fracture aren’t the same across levels of bonemed and bonemed_fu. If bonemed_fu had no effect you would expect the no/yes bar pattern of bonemed to be the same for the levels of bonemed_fu. And if bonemed had no effect, you would expect the no bonemed_fu bars to be the same and the yes bars to be the same.
- This suggests that women who never took meds have lower prop of fractures (or lower risk). If they didn’t take meds before but started, their rate is highest. If they took meds but stopped, they also have a high proportion. Consistent medication use before and after is a moderate rate.

Here, I create pairwise scatterplots of the numerical predictors split by response group.

# Function to create scatter plots for numerical variables with grouping by fracture
make_scatterplot <- function(var1, var2, group_var) {
  bone %>%
    ggplot(aes(x = .data[[var1]], y = .data[[var2]], color = .data[[group_var]])) +
    geom_point(alpha = 0.5, position = position_jitter(width = 0.2, height = 0.2)) +  # Jitter for visibility
    geom_smooth(formula = "y~x", method = "loess", se = TRUE) +  # Add LOESS smoothing line
    labs(
      title = paste(var1, "vs", var2, "by", group_var),
      x = var1,
      y = var2,
      color = group_var
    ) +
    scale_color_manual(values = c("blue3", "red3")) + 
    theme_bw() +
    
    theme(legend.position = "bottom")  # Adjust legend position for clarity
}

# Get the names of numerical variables
num_var_names <- names(bone_num)

# Loop through each pair of numerical variables
for (i in 1:(length(num_var_names) - 1)) {
  var1 <- num_var_names[i]
  
  for (j in (i + 1):length(num_var_names)) {
    var2 <- num_var_names[j]
    
    # Create scatter plot
    scatter <- make_scatterplot(var1, var2, group_var = "fracture")
    
    # Display the plot
    print(scatter)
    
    # Save the plot
    filename <- paste0("figures/scatterCombo_", var1, "_vs_", var2, ".png")
    # ggsave(filename, plot = scatter, device = "png")
  }
}

From the paired scatterplots, there is no strong evidence of interactions between numerical variables. The trend between the fracture groups look consistent. However, there are some multicollinearity concerns between age and fracscore and weight and bmi.

Lastly, I confirm multicollinearity suspicions among numerical predictors by creating a correlation and scatterplot matrix.

# This makes a correlation matrix of the numerical variables to confirm suspicions of multicollinearity issues indicated by the scatterplot matrices.  

# Calculate correlation matrix for numeric variables
cor_matrix = cor(bone_num)

# Save the correlation plot
# png("figures/correlation_matrix.png", width = 10, height = 8, units = "in", res = 300)

# Visualize the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, 
         addCoef.col = "black", number.cex = 0.7)

# dev.off()  # Close the device to finalize the file for saving


# Combine numerical predictors with the fracture variable
bone_num_with_fracture <- cbind(bone_num, fracture = bone$fracture)
bone_num_with_fracture$fracture <- factor(bone_num_with_fracture$fracture, levels = c("Yes", "No"))

# Create scatterplot matrix
scatterplot_matrix <- ggpairs(
  bone_num_with_fracture,
  aes(color = fracture),
  lower = list(
    continuous = wrap("smooth_loess", se = FALSE, alpha = 0.4)  # Scatterplot with LOESS lines
  ),
  diag = list(
    continuous = wrap("densityDiag", alpha = 0.6)  # Density plots on the diagonal
  )) +
  scale_color_brewer(palette = 'Set1') + 
  scale_fill_brewer(palette = 'Set1')
scatterplot_matrix

# Optionally save the plot
# ggsave(filename = "figures/scatterplot_matrix.png", plot = scatterplot_matrix, width = 10, height = 8, dpi = 300)

Multicollinearity concerns:
There is high correlation between bmi and weight and between age and fracscore. Consider the VIFs or consider only including one variable from each pair in the model.

Possible interactions:
The pairwise scatterplot of numerical variable do not suggest there is different behavior between the response classes. The relationships look very similar.

Summarizing EDA:

Variables likely to influence the probability of fracture:
- age, fracscore, and priorfrac
Variables which may be useful:
- armassist, momfrac, raterisk, and height; bonemed, bonemed_fu, and bonetreat may be useful but confounding
Variables which seem unimportant:
- site_id, phy_id, weight, bmi, premeno, and smoke
Variables to drop from the analysis:
- site_id and phy_id
Potential interactions to explore:
- momfrac: armassist, raterisk, bone medication variables
- smoke: height, momfrac,raterisk, bone medication variables
- raterisk: premeno, bone medication variables
- priorfrac: premeno, momfrac, bone medication variables
- age/fracscore: priorfrac
- weight/bmi: meds/treatments, raterisk, priorfrac
Consider these for potential confounders:
- raterisk & smoke
- bone medication variables: bonetreat is an interaction of bonemed and bonemed_fu
- age, fracscore, premeno
- fracscore & armassist
- bmi & weight: bmi is weight (\(kg\)) divided by height squared (\(m^2\)).
Also, consider extreme values in height when building models.

Objective 1: Logistic Regression Model for Interpretation

For the rest of the model fitting and analysis, I will split the data into a training and validation set.

bone_subset = bone %>% select(-sub_id, -fracture.num) # remove columns prior to feature selection

#Train/Validation Split (team decision to use 70/30 split and seed `1234`)
set.seed(1234) #consider trying different seeds/splits to get a feel for how stable the models are
trainIndex<-createDataPartition(bone$fracture,p=.7,list=F) #p: proportion of data in train; original models built with .8

training<-bone_subset[trainIndex,]
validate<-bone_subset[-trainIndex,]

Here, I use penalized regression for feature selection and stepwise feature selection. I’d like to get a sense of the important predictors. I chose to remove the metadata variables site_id and phy_id and the interaction variables bmi and bonetreat. I commented out the code for stepwise selection, so the output would be less overwhelming.

# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)

# GLMNET
set.seed(1234)
glmnet.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)

14 x 1 sparse Matrix of class "dgCMatrix"
                s=0.02283267
(Intercept)       0.74937631
priorfracYes      0.44011362
age               .         
weight            .         
height           -0.01645137
premenoYes        .         
momfracYes        0.51513183
armassistYes      .         
smokeYes          .         
rateriskSame      .         
rateriskGreater   .         
fracscore         0.13853960
bonemedYes        .         
bonemed_fuYes     0.17244372

# includes priorfrac, height, momfrac, fracscore, bonemed_fu

# LASSO
set.seed(1234)
lasso.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss",
                    tuneGrid = expand.grid(alpha = 1,  # Lasso penalty
                                          lambda = seq(0.01, 10, length = 200))  # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)

14 x 1 sparse Matrix of class "dgCMatrix"
                s=0.06020101
(Intercept)       -1.5253090
priorfracYes       0.1442626
age                .        
weight             .        
height             .        
premenoYes         .        
momfracYes         .        
armassistYes       .        
smokeYes           .        
rateriskSame       .        
rateriskGreater    .        
fracscore          0.1009059
bonemedYes         .        
bonemed_fuYes      .

# includes only priorfrac and fracscore

#STEPWISE
# set.seed(1234)
# #note CV and error metric are not really used here, but logLoss is reported for the final model.
# Step.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
#                     data=training,
#                     method="glmStepAIC",
#                     trControl=fitControl,
#                     metric="logLoss")
# coef(Step.fit$finalModel)
# # priorfrac, age, height, momfrac, armassist


# AUROC Curve metric code: twoClassSummary and "ROC" options
fitControl2<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=twoClassSummary)

# LASSO
set.seed(1234)
lassoroc.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
                    data=training,
                    method="glmnet",
                    trControl=fitControl2,
                    metric="ROC",
                    tuneGrid = expand.grid(alpha = 1,   # Lasso penalty
                                          lambda = seq(0.01, 1, length = 200))  # Range of lambda values to test
)
coef(lassoroc.fit$finalModel,lassoroc.fit$finalModel$lambdaOpt)

14 x 1 sparse Matrix of class "dgCMatrix"
                s=0.07964824
(Intercept)     -1.371835977
priorfracYes     0.002748291
age              .          
weight           .          
height           .          
premenoYes       .          
momfracYes       .          
armassistYes     .          
smokeYes         .          
rateriskSame     .          
rateriskGreater  .          
fracscore        0.072403610
bonemedYes       .          
bonemed_fuYes    .

# priorfrac, fracscore

# STEPWISE
# set.seed(1234)
# Steproc.fit<-train(fracture ~ . -site_id -phy_id -bmi -bonetreat,
#                     data=training,
#                     method="glmStepAIC",
#                     trControl=fitControl2,
#                     metric="ROC")
# coef(Step.fit$finalModel)
# # priorfrac, age, height, momfrac, armassist

Originally, I ran the feature selection code above with all variables except the patient id. All variables in the methods above, 3 using a logloss error metric and 2 using AUROC produced these models:

GLMNET which includes all variables except phy_id, weight, fracscore
LASSO (logloss) removes phy_id, weight, smokeYes, bonemedYes, bonetreatYes
LASSO (ROC) includes priorfracYes, height, momfracYes, fracscore, bonemed_fuYes
Stepwise (logloss) includes priorfracYes, age, weight, bmi, momfracYes, bonemedYes, bonemed_fuYes, bonetreatYes
Stepwise (ROC) includes priorfracYes, age, weight, bmi, momfracYes, bonemedYes, bonemed_fuYes, bonetreatYes
Frequency of variables in the feature selection models:
- 5, common to all: priorfrac, momfrac and bonemed_fu
- 4: age, bmi
- 3: height, bonemed, bonetreat
- 2: weight, premeno, armassist raterisk, fracscore
- 1: smoke

After additional thought, I removed the site and physician id variables and bmi and bonetreat and reran the feature selection.

GLMNET includes priorfrac, height, momfrac, fracscore, bonemed_fu
LASSO (logloss & ROC) includes only priorfrac and fracscore
Stepwise (logloss & ROC) includes priorfrac, age, height, momfrac, armassist
Frequency of variables in the feature selection models:
- 3, common to all: priorfrac
- 2: height, momfrac, fracscore
- 1: bonemed_fu, age, armassist

Here, I explore heatmap clustering mostly just for practice. With so few numerical predictors, I don’t expect it to be very useful for this analysis.

# Just playing around with making a heatmap of the numerical variables

str(bone_num_with_fracture)

'data.frame':   500 obs. of  8 variables:
 $ site_id  : int  1 4 6 6 1 5 5 1 1 4 ...
 $ phy_id   : int  14 284 305 309 37 299 302 36 8 282 ...
 $ age      : int  62 65 88 82 61 67 84 82 86 58 ...
 $ weight   : num  70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
 $ height   : int  158 160 157 160 152 161 150 153 156 166 ...
 $ bmi      : num  28.2 34 20.6 24.3 29.4 ...
 $ fracscore: int  1 2 11 5 1 4 6 7 7 0 ...
 $ fracture : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...

set.seed(1234)
trainIndex<-createDataPartition(bone_num_with_fracture$fracture,p=.8,list=F)  #p: proportion of data in train
dat.train<-bone_num_with_fracture[trainIndex,]
dat.val<-bone_num_with_fracture[-trainIndex,]

#Seperating predictors from response
dat.train.x <- dat.train[,1:7]
dat.train.y <- dat.train$fracture

#Using smaller set of predictors to show clustering results changing
pheatmap(dat.train.x[,3:7],fontsize_row=3,annotation_row=data.frame(fracture=dat.train.y),scale="column",legend=T,color=colorRampPalette(c("blue3","white", "red3"), space = "rgb")(100),annotation_colors=list(fracture=c("No"="white","Yes"="darkgreen")))

Here, I fit a bunch of models based on the EDA, feature selection, and intuition. I compare the number of predictors, logloss error from both the k-fold cross-validation and validation set, AUROC, sensitivity, specificity, PPV, NPV, and prevalence based on a threshold calculated to achieve the highest sum of sensitivity and specificity.

# List of model formulas
models <- list(
  # these are the models I worked on with my first feature selection
  # fit5 = fracture ~ priorfrac + momfrac + bonemed_fu,
  # fit54a = fracture ~ priorfrac + momfrac + bonemed_fu + age,
  # fit54b = fracture ~ priorfrac + momfrac + bonemed_fu + bmi,
  # fit54 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi,
  fit543a = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height,
  # fit543b = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonemed,
  # fit543c = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat,
  # fit543ca = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonetreat,
  fit5432c1 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonetreat + armassist,
  # fit5432e2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + fracscore,
  # fit5432ce2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + armassist + fracscore, # high VIF
  # fit5432be2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + fracscore,
  fit5432de2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + raterisk + fracscore,
  fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore,
  # fit5432bcde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + armassist + raterisk + fracscore,
  # fit54321abcde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + armassist + raterisk + fracscore + smoke, # high VIF
  # fitglm = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + bonemed + bonetreat + premeno + armassist + raterisk  + smoke,
  # fitstep = fracture ~ priorfrac + age + weight + bmi + momfrac + bonemed + bonemed_fu + bonetreat,
  fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
  fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
  # these I decided to try after thinking more about the variables that should be excluded
  # fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
  fit_noage = fracture ~ . -site_id -phy_id -bmi -bonetreat -age, # reduce multicollinearity
  fit_noscore = fracture ~ . -site_id -phy_id -bmi -bonetreat -fracscore, # reduce multicollinearity
  fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist, # 2nd feature sel model no age,
  fit_fsnoscore = fracture ~ priorfrac + height + momfrac + bonemed_fu + age + armassist # 2nd feature sel model no age
)

 # Set seed for reproducibility
set.seed(1234)

# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
  model_formula <- models[[model_name]]
  train(
    model_formula,
    data = training,
    method = "glm",
    family = "binomial",
    trControl = fitControl,
    metric = "logLoss"
  )
})

# Predict probabilities and calculate metrics for validation set
results <- lapply(seq_along(models), function(i) {
  model_name <- names(models)[i]
  model_formula <- models[[model_name]]

  # Count the number of predictors in the model
  num_predictors <- length(all.vars(model_formula)) - 1  # Exclude the response variable

  # Fit the model on the training data
  model <- glm(model_formula, data = training, family = "binomial")

  # Predict probabilities on the validation set
  pred_probs <- predict(model, newdata = validate, type = "response")

  # Calculate AUC
  roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
  auc_value <- as.numeric(pROC::auc(roc_obj))

  # Calculate LogLoss for validation
  logloss <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)

  # Extract CV LogLoss
  cv_logloss <- min(cv_results[[i]]$results$logLoss)

  # Calculate VIF
  vifs <- vif(model)
  max_vif <- max(vifs, na.rm = TRUE)

  # Perform Hosmer-Lemeshow test with dynamically adjusted groups
  n_groups <- min(10, floor(nrow(training) / 5))  # Ensure at least 5 observations per group
  hoslem_test <- tryCatch(
    hoslem.test(model$y, fitted(model), g = n_groups),
    error = function(e) NULL
  )
  hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA

  # Get the best thresholds using Youden's method
  youden_coords <- tryCatch(
    coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "youden", transpose = TRUE),
    error = function(e) NULL
  )

  # Initialize thresholds and metrics
  best_threshold <- NA
  sensitivity <- NA
  specificity <- NA

  # Process Youden's Index result
  if (!is.null(youden_coords)) {
    if (is.atomic(youden_coords)) {
      # Single result
      best_threshold <- youden_coords["threshold"]
      sensitivity <- youden_coords["sensitivity"]
      specificity <- youden_coords["specificity"]
    } else if (nrow(youden_coords) > 1) {
      # Tied thresholds: Select the most balanced sensitivity/specificity
      youden_coords$balance <- abs(youden_coords$sensitivity - youden_coords$specificity)
      best_coords <- youden_coords[which.min(youden_coords$balance), ]
      best_threshold <- best_coords["threshold"]
      sensitivity <- best_coords["sensitivity"]
      specificity <- best_coords["specificity"]
    }
  }

  # Fallback to closest.topleft if Youden's fails
  if (is.na(best_threshold)) {
    closest_coords <- tryCatch(
      coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"), best.method = "closest.topleft", transpose = TRUE),
      error = function(e) NULL
    )
    if (!is.null(closest_coords)) {
      best_threshold <- closest_coords["threshold"]
      sensitivity <- closest_coords["sensitivity"]
      specificity <- closest_coords["specificity"]
    }
  }

  # Calculate additional metrics: PPV, NPV, Prevalence
  ppv <- npv <- prevalence <- NA
  if (!is.na(best_threshold)) {
    predicted_class <- ifelse(pred_probs >= best_threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
    ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
    npv <- ifelse("No" %in% rownames(confusion), confusion["No", "No"] / sum(confusion["No", ]), NA)
  }
  prevalence <- mean(validate$fracture == "Yes")

  # Return metrics
  list(
    Model = model_name,
    AUROC = auc_value,
    LogLoss_Validation = logloss,
    LogLoss_CV = cv_logloss,
    MaxVIF = max_vif,
    HoslemPValue = hoslem_pval,  # Add Hosmer-Lemeshow p-value
    Sensitivity = sensitivity,
    Specificity = specificity,
    PPV = ppv,
    NPV = npv,
    BestThreshold = best_threshold,
    NumPredictors = num_predictors,
    Prevalence = prevalence
  )
})

Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.
Warning in coords.roc(roc_obj, "best", ret = c("threshold", "sensitivity", :
'transpose=TRUE' is deprecated. Only 'transpose=FALSE' will be allowed in a
future version.

# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))

# Sort the results
results_df_sorted <- results_df[order(-results_df$AUROC, results_df$LogLoss_Validation), ]

# Print the sorted data frame
print(results_df_sorted)

                     Model     AUROC LogLoss_Validation LogLoss_CV   MaxVIF
sensitivity6     fit_noage 0.7434846          0.5111676  0.5449074 3.253960
sensitivity5      fitlasso 0.7393822          0.5108552  0.5421378 2.000000
sensitivity8   fit_fsnoage 0.7388996          0.5150674  0.5255116 2.025420
sensitivity7   fit_noscore 0.7362452          0.5124483  0.5424825 3.239427
sensitivity9 fit_fsnoscore 0.7322635          0.5163471  0.5309244 1.153747
sensitivity4   fitlassoroc 0.7322635          0.5180295  0.5234153 1.298430
sensitivity2    fit5432de2 0.7258687          0.5145238  0.5310958 6.205696
sensitivity        fit543a 0.7229730          0.5212556  0.5275913 1.149200
sensitivity3   fit5432bde2 0.7188707          0.5161157  0.5377821 6.128962
sensitivity1     fit5432c1 0.7171815          0.5187222  0.5208754 3.767293
             HoslemPValue Sensitivity Specificity       PPV       NPV
sensitivity6    0.9027294   0.8378378   0.5892857 0.4025974 0.9166667
sensitivity5    0.5321824   0.8648649   0.6071429 0.4210526 0.9315068
sensitivity8    0.6696229   0.8378378   0.6339286 0.4305556 0.9220779
sensitivity7    0.7138617   0.8378378   0.5892857 0.4025974 0.9166667
sensitivity9    0.7410397   0.8378378   0.6071429 0.4133333 0.9189189
sensitivity4    0.2657679   0.7837838   0.6696429 0.4393939 0.9036145
sensitivity2    0.8514311   0.8648649   0.5892857 0.4102564 0.9295775
sensitivity     0.2202417   0.7837838   0.6785714 0.4461538 0.9047619
sensitivity3    0.8454629   0.8918919   0.5446429 0.3928571 0.9384615
sensitivity1    0.3174713   0.8108108   0.6428571 0.4285714 0.9113924
             BestThreshold NumPredictors Prevalence
sensitivity6     0.1903572             6  0.2483221
sensitivity5     0.2048897            10  0.2483221
sensitivity8     0.1992032             6  0.2483221
sensitivity7     0.1918706             6  0.2483221
sensitivity9     0.1962048             6  0.2483221
sensitivity4     0.2338080             5  0.2483221
sensitivity2     0.1941760             8  0.2483221
sensitivity      0.2358142             6  0.2483221
sensitivity3     0.1926877             9  0.2483221
sensitivity1     0.1984304             8  0.2483221

# Specify the models you want to plot
models_to_plot <- c("fit_noage", "fit_fsnoage", "fit_noscore", "fit_fsnoscore", "fit5432bde2", "fit5432c1", "fitlasso", "fitlassoroc")

# Filter results for the specified models
filtered_results <- results_df_sorted[results_df_sorted$Model %in% models_to_plot, ]

# Create a plot with different colors for each model
ggplot(filtered_results, aes(x = Sensitivity, y = PPV, color = Model, label = Model)) +
  geom_point(size = 3) +
  geom_text(vjust = -0.5, hjust = 0.5, size = 3) +
  labs(
    title = "Sensitivity vs. PPV for Selected Models",
    x = "Sensitivity",
    y = "Positive Predictive Value (PPV)",
    color = "Model"
  ) +
  theme_bw() +
  scale_color_brewer(palette = "Dark2")

# My first best models for getting a reasonably high sensitivity >.6, followed by PPV and the other metrics were:
# fit5432bde2, lassoroc (nearly as good, much smaller model), fitlasso (for optimizing PPV, with .6 specificity)
# I had picked one: finalObj1Model = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore #fit5432bde2

# After reconsidering variable selection good candidates look like: fit_fsnoage, fitlasso, fit_fsnoscore, fitnoage

Dr. Turner suggested that I find the threshold for each model at for a particular value for one metric, for example PPV of 0.85, so that it would be easier to compare the other metrics for each model. These models couldn’t reach a PPV of 0.85, so I choose 0.5. I repeated the model fitting above but instead of choosing a threshold based on the highest combined sensitivity and specificity, choosing it for a PPV of 0.5.

# List of model formulas
models <- list(
  tue = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed, 
  tom = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
  fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore,
  # fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
  # fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
  # fitlasso_nobmi = fracture ~ priorfrac + momfrac + bonemed_fu + age + height + premeno + armassist + raterisk + smoke, #10
  # these I decided to try after thinking more about the variables that should be excluded
  # fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
  fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  fit_noage2 = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  fit_noagea = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  fit_noage2b = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed,
  fit_noscore2b = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + age + bonemed,
  fit_noage4a = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed_fu,
  fit_noscore3a = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + age + bonemed_fu
  # fit_noscore = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + age + bonemed + bonemed_fu,
  # fit_noage3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed + bonemed_fu, #9
  # fit_noscore3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + age + bonemed + bonemed_fu,
  # fit_noage4 = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed + bonemed_fu,
  # fit_noscore4 = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + age + bonemed + bonemed_fu,
  # fit_noage2a = fracture ~ priorfrac + weight + height + momfrac + armassist + smoke + raterisk + fracscore + bonemed_fu,
  # fit_noscore4a = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + age + bonemed_fu,
  # fit_noageb = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed,
  # fit_noage3b = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed,
  # fit_noage4b = fracture ~ priorfrac + weight + height + momfrac + armassist + raterisk + fracscore + bonemed,
  # fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist # 2nd feature sel model no age
  # fit_fsnoscore = fracture ~ priorfrac + height + momfrac + age + bonemed_fu + armassist # 2nd feature sel model no fracscore
)

# Set seed for reproducibility
set.seed(1234)

# Set up trainControl for repeated CV
fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = TRUE,
  summaryFunction = mnLogLoss
)

# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
  model_formula <- models[[model_name]]
  train(
    model_formula,
    data = training,
    method = "glm",
    family = "binomial",
    trControl = fitControl,
    metric = "logLoss"
  )
})

results <- lapply(seq_along(models), function(i) {
  model_name <- names(models)[i]
  model_formula <- models[[model_name]]

  # Count the number of predictors in the model
  num_predictors <- length(all.vars(model_formula)) - 1  # Exclude the response variable

  # Fit the model on the training data
  model <- glm(model_formula, data = training, family = "binomial")

  # Predict probabilities on the validation set
  pred_probs <- predict(model, newdata = validate, type = "response")

  # Calculate AUC
  roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
  auc_value <- as.numeric(pROC::auc(roc_obj))

  # Calculate LogLoss for validation
  logloss <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)

  # Extract CV LogLoss
  cv_logloss <- min(cv_results[[i]]$results$logLoss)

  # Calculate VIF
  vifs <- vif(model)
  max_vif <- max(vifs, na.rm = TRUE)

  # Perform Hosmer-Lemeshow test with dynamic bin adjustment
  n_groups <- min(10, floor(nrow(training) / 5))  # At least 5 observations per group
  hoslem_test <- tryCatch(
    hoslem.test(model$y, fitted(model), g = n_groups),
    error = function(e) NULL
  )
  hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA

  # Find the threshold for desired PPV (e.g. 0.85)
  desired_ppv <- 0.50 # Adjusted for achievable PPV
  all_thresholds <- seq(0, 1, by = 0.01)  # Search thresholds from 0 to 1
  ppv_threshold <- NA
  closest_ppv <- Inf
  best_threshold <- NA

  for (threshold in all_thresholds) {
    predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate$fracture)

    # Calculate PPV
    current_ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
    
    if (!is.na(current_ppv)) {
      # Check if the current threshold matches the desired PPV within a margin
      if (abs(current_ppv - desired_ppv) < 0.01) {
        ppv_threshold <- threshold
        break
      }
      
      # Track the closest PPV if no exact match is found
      if (abs(current_ppv - desired_ppv) < abs(closest_ppv - desired_ppv)) {
        closest_ppv <- current_ppv
        best_threshold <- threshold
      }
    }
  }

  # If no exact PPV match, use the closest threshold
  if (is.na(ppv_threshold)) {
    ppv_threshold <- best_threshold
  }

  # Calculate Sensitivity, Specificity, PPV, and NPV at the PPV threshold
  sensitivity <- specificity <- ppv <- npv <- NA
  if (!is.na(ppv_threshold)) {
    predicted_class <- ifelse(pred_probs >= ppv_threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate$fracture)

    sensitivity <- confusion["Yes", "Yes"] / sum(confusion[, "Yes"])
    specificity <- confusion["No", "No"] / sum(confusion[, "No"])
    ppv <- confusion["Yes", "Yes"] / sum(confusion["Yes", ])
    npv <- confusion["No", "No"] / sum(confusion["No", ])
  }
  prevalence <- mean(validate$fracture == "Yes")

  # Return metrics
  list(
    Model = model_name,
    NumPredictors = num_predictors,
    AUROC = auc_value,
    LogLoss_Validation = logloss,
    LogLoss_CV = cv_logloss,
    MaxVIF = max_vif,
    HoslemPValue = hoslem_pval,  # Hosmer-Lemeshow p-value
    Sensitivity = sensitivity,
    Specificity = specificity,
    PPV = ppv,
    NPV = npv,
    PPV_Threshold = ppv_threshold
  )
})

# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))

# Sort the results
results_df_sorted <- results_df[order(-results_df$Sensitivity, -results_df$AUROC, results_df$LogLoss_Validation), ]

# Print the sorted data frame
print(results_df_sorted)

           Model NumPredictors     AUROC LogLoss_Validation LogLoss_CV   MaxVIF
5     fit_noage2            10 0.7446911          0.5106273  0.5455846 3.216538
4      fit_noage            11 0.7434846          0.5111676  0.5501318 3.253960
6     fit_noagea            11 0.7434846          0.5111676  0.5450384 3.253960
7    fit_noage2b             9 0.7388996          0.5155075  0.5374377 2.615609
8  fit_noscore2b             9 0.7355212          0.5163885  0.5362377 2.000000
3    fit5432bde2             9 0.7188707          0.5161157  0.5339928 6.128962
2            tom             8 0.7468629          0.5106481  0.5331435 2.594040
9    fit_noage4a             8 0.7468629          0.5106481  0.5359266 2.594040
10 fit_noscore3a             9 0.7398649          0.5119081  0.5403481 2.000000
1            tue             8 0.7345560          0.5172054  0.5330209 2.000000
   HoslemPValue Sensitivity Specificity PPV       NPV PPV_Threshold
5     0.9142989   0.5135135   0.8303571 0.5 0.8378378          0.32
4     0.9027294   0.5135135   0.8303571 0.5 0.8378378          0.32
6     0.9027294   0.5135135   0.8303571 0.5 0.8378378          0.32
7     0.3991309   0.4324324   0.8571429 0.5 0.8205128          0.36
8     0.4142213   0.4324324   0.8571429 0.5 0.8205128          0.34
3     0.8454629   0.2972973   0.9017857 0.5 0.7952756          0.43
2     0.8162808   0.2432432   0.9196429 0.5 0.7862595          0.45
9     0.8162808   0.2432432   0.9196429 0.5 0.7862595          0.45
10    0.3103749   0.2432432   0.9196429 0.5 0.7862595          0.45
1     0.3341375   0.1891892   0.9375000 0.5 0.7777778          0.50

# I think the best model for getting a reasonably high sensitivity, PPV and the other metrics is fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu.

After adjusting the model’s prediction thresholds based on a constant PPV, this model seems to perform the best, with a comparatively high AUROC and high sensitivity at a fixed PPV. The VIF values are low and the Hosmer-Lemeshow test doesn’t suggest evidence against goodness of fit. It excludes site_id, phy_id, bmi and bonetreat (they are interactions with variables I kept), and age (it is highly correlated with fracscore, which performed better). fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu

Here, I want to rerun the models which include height without the extreme value to compare their performance before and after.

#Train/Validation split without low or high heights
training2<-bone_subset %>% filter(!(bone$height < 140 | bone$height > 190))
validate2<-bone_subset %>% filter(!(bone$height < 140 | bone$height > 190))

# List of model formulas
models <- list(
  tue = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed, 
  tom = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
  fit543a = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height,
  # fit5432bde2 = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + bonetreat + premeno + raterisk + fracscore, # no height
  fitlassoroc = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu,
  fitlasso = fracture ~ priorfrac + momfrac + bonemed_fu + age + bmi + height + premeno + armassist + raterisk + smoke,
  fitlasso_nobmi = fracture ~ priorfrac + momfrac + bonemed_fu + age + height + premeno + armassist + raterisk + smoke,
  # these I decided to try after thinking more about the variables that should be excluded
  # fit_all = fracture ~ . -site_id -phy_id -bmi -bonetreat, # high VIF
  fit_noage = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu, # -age, reduce multicollinearity
  # fit_noscore = fracture ~ . -site_id -phy_id -bmi -bonetreat -fracscore, # reduce multicollinearity
  fit_fsnoage = fracture ~ priorfrac + height + momfrac + fracscore + bonemed_fu + armassist, # 2nd feature sel model no age,
  fit_fsnoscore = fracture ~ priorfrac + height + momfrac + bonemed_fu + age + armassist # 2nd feature sel model no age
)

# Set seed for reproducibility
set.seed(1234)

# Set up trainControl for repeated CV
fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = TRUE,
  summaryFunction = mnLogLoss
)

# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
  model_formula <- models[[model_name]]
  train(
    model_formula,
    data = training2,  # Updated to training2
    method = "glm",
    family = "binomial",
    trControl = fitControl,
    metric = "logLoss"
  )
})

results <- lapply(seq_along(models), function(i) {
  model_name <- names(models)[i]
  model_formula <- models[[model_name]]

  # Count the number of predictors in the model
  num_predictors <- length(all.vars(model_formula)) - 1  # Exclude the response variable

  # Fit the model on the training data
  model <- glm(model_formula, data = training2, family = "binomial")  # Updated to training2

  # Predict probabilities on the validation set
  pred_probs <- predict(model, newdata = validate2, type = "response")  # Updated to validate2

  # Calculate AUC
  roc_obj <- pROC::roc(response = validate2$fracture, predictor = pred_probs, levels = c("No", "Yes"))  # Updated to validate2
  auc_value <- as.numeric(pROC::auc(roc_obj))

  # Calculate LogLoss for validation
  logloss <- Metrics::logLoss(actual = ifelse(validate2$fracture == "Yes", 1, 0), predicted = pred_probs)  # Updated to validate2

  # Extract CV LogLoss
  cv_logloss <- min(cv_results[[i]]$results$logLoss)

  # Calculate VIF
  vifs <- vif(model)
  max_vif <- max(vifs, na.rm = TRUE)

  # Perform Hosmer-Lemeshow test with dynamic bin adjustment
  n_groups <- min(10, floor(nrow(training2) / 5))  # Updated to training2
  hoslem_test <- tryCatch(
    hoslem.test(model$y, fitted(model), g = n_groups),
    error = function(e) NULL
  )
  hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA

  # Find the threshold for desired PPV (e.g. 0.85)
  desired_ppv <- 0.50  # Adjusted for achievable PPV
  all_thresholds <- seq(0, 1, by = 0.01)  # Search thresholds from 0 to 1
  ppv_threshold <- NA
  closest_ppv <- Inf
  best_threshold <- NA

  for (threshold in all_thresholds) {
    predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate2$fracture)  # Updated to validate2

    # Calculate PPV
    current_ppv <- ifelse("Yes" %in% rownames(confusion), confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
    
    if (!is.na(current_ppv)) {
      # Check if the current threshold matches the desired PPV within a margin
      if (abs(current_ppv - desired_ppv) < 0.01) {
        ppv_threshold <- threshold
        break
      }
      
      # Track the closest PPV if no exact match is found
      if (abs(current_ppv - desired_ppv) < abs(closest_ppv - desired_ppv)) {
        closest_ppv <- current_ppv
        best_threshold <- threshold
      }
    }
  }

  # If no exact PPV match, use the closest threshold
  if (is.na(ppv_threshold)) {
    ppv_threshold <- best_threshold
  }

  # Calculate Sensitivity, Specificity, PPV, and NPV at the PPV threshold
  sensitivity <- specificity <- ppv <- npv <- NA
  if (!is.na(ppv_threshold)) {
    predicted_class <- ifelse(pred_probs >= ppv_threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate2$fracture)  # Updated to validate2

    sensitivity <- confusion["Yes", "Yes"] / sum(confusion[, "Yes"])
    specificity <- confusion["No", "No"] / sum(confusion[, "No"])
    ppv <- confusion["Yes", "Yes"] / sum(confusion["Yes", ])
    npv <- confusion["No", "No"] / sum(confusion["No", ])
  }
  prevalence <- mean(validate2$fracture == "Yes")  # Updated to validate2

  # Return metrics
  list(
    Model = model_name,
    NumPredictors = num_predictors,
    AUROC = auc_value,
    LogLoss_Validation = logloss,
    LogLoss_CV = cv_logloss,
    MaxVIF = max_vif,
    HoslemPValue = hoslem_pval,  # Hosmer-Lemeshow p-value
    Sensitivity = sensitivity,
    Specificity = specificity,
    PPV = ppv,
    NPV = npv,
    PPV_Threshold = ppv_threshold
  )
})

# Convert the results into a data frame for easier reading
results_df <- do.call(rbind, lapply(results, as.data.frame))

# Sort the results
results_df_sorted <- results_df[order(-results_df$Sensitivity, -results_df$AUROC, results_df$LogLoss_Validation), ]

# Print the sorted data frame
print(results_df_sorted)

           Model NumPredictors     AUROC LogLoss_Validation LogLoss_CV   MaxVIF
5       fitlasso            10 0.7270787          0.5017758  0.5267315 2.000000
2            tom             8 0.7243402          0.5036073  0.5200329 2.601416
7      fit_noage            11 0.7282646          0.5021859  0.5337317 3.384928
1            tue             8 0.7232405          0.5056559  0.5280232 2.000000
9  fit_fsnoscore             6 0.7213968          0.5068034  0.5180951 1.153743
8    fit_fsnoage             6 0.7212136          0.5076248  0.5234969 2.019138
4    fitlassoroc             5 0.7191974          0.5091290  0.5198037 1.322911
6 fitlasso_nobmi             9 0.7307012          0.5032898  0.5303461 2.000000
3        fit543a             6 0.7098823          0.5086573  0.5213068 1.190909
  HoslemPValue Sensitivity Specificity       PPV       NPV PPV_Threshold
5   0.39409681   0.4838710   0.8342246 0.4918033 0.8297872          0.33
2   0.09267098   0.4838710   0.8342246 0.4918033 0.8297872          0.33
7   0.13340208   0.4596774   0.8422460 0.4913793 0.8246073          0.34
1   0.41711263   0.4354839   0.8502674 0.4909091 0.8195876          0.34
9   0.78307274   0.4032258   0.8609626 0.4901961 0.8131313          0.36
8   0.50459409   0.3064516   0.9010695 0.5066667 0.7966903          0.40
4   0.38981163   0.2903226   0.9064171 0.5070423 0.7939110          0.40
6   0.01578518   0.2822581   0.9090909 0.5072464 0.7925408          0.43
3   0.22735424   0.2500000   0.9197861 0.5081967 0.7871854          0.44

Removing the two extreme height observations does not seem to improve the performance of the models that include height. It seems unlikely that 2 points out of 500 would have a large impact, so I think for now sticking with the previously chosen model would be a good bet.

Here, I want to optimize the threshold of the chosen model(s) to achieve the highest PPV possible with a specificity of at least 0.80. I want to ensure that at least 80% of the patients at high risk of fractures receive treatment and minimize the number of women at low risk that are treated unnecessarily.

# # Tue's
# final_model <- glm(
#   formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed,
#   data = training,
#   family = "binomial"
# )

# # Tom's
# final_model <- glm(
#   formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
#   data = training,
#   family = "binomial"
# )

# Fit the model on the training data
final_model <- glm(
  formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  data = training,
  family = "binomial"
)

# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")

# Define a target Sensitivity threshold
target_sensitivity <- 0.8  # Adjust to 0.75, 0.85, or desired value

# Function to calculate performance metrics for a given threshold
calculate_metrics <- function(threshold, pred_probs, actual) {
  # Predicted classes based on threshold
  pred_class <- ifelse(pred_probs > threshold, "Yes", "No")
  
  # Ensure factors have the same levels
  actual <- factor(actual, levels = c("No", "Yes"))
  pred_class <- factor(pred_class, levels = c("No", "Yes"))
  
  # Generate the confusion matrix
  cm <- table(Predicted = pred_class, Actual = actual)
  
  # Extract counts from the confusion matrix
  tp <- cm["Yes", "Yes"]  # True Positives
  fp <- cm["Yes", "No"]   # False Positives
  tn <- cm["No", "No"]    # True Negatives
  fn <- cm["No", "Yes"]   # False Negatives
  
  # Safely calculate metrics
  ppv <- ifelse((tp + fp) > 0, tp / (tp + fp), NA)  # Positive Predictive Value (Precision)
  sensitivity <- ifelse((tp + fn) > 0, tp / (tp + fn), NA)  # Sensitivity (Recall)
  specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA)  # Specificity
  accuracy <- ifelse((tp + tn + fp + fn) > 0, (tp + tn) / (tp + tn + fp + fn), NA)  # Accuracy

  # Return metrics as a data frame
  metrics <- data.frame(
    Threshold = threshold,
    Sensitivity = sensitivity,
    Specificity = specificity,
    PPV = ppv,
    Accuracy = accuracy
  )
  return(metrics)
}

# Generate metrics for a refined range of thresholds
thresholds <- seq(0.1, 0.4, by = 0.01)  # Narrowed range
metrics_list <- lapply(thresholds, calculate_metrics, pred_probs = pred_probs, actual = validate$fracture)
metrics_df <- do.call(rbind, metrics_list)

# Filter metrics for Sensitivity >= target_sensitivity
valid_metrics <- metrics_df[metrics_df$Sensitivity >= target_sensitivity, ]

# Ensure there are valid thresholds
if (nrow(valid_metrics) == 0) {
  stop("No thresholds meet the target Sensitivity. Try lowering target_sensitivity.")
}

# Find the best threshold (maximizing PPV)
best_threshold <- valid_metrics$Threshold[which.max(valid_metrics$PPV)]
best_metrics <- valid_metrics[valid_metrics$Threshold == best_threshold, ]

# Output the best threshold and its metrics
cat("Best threshold with Sensitivity >=", target_sensitivity, ":\n")

Best threshold with Sensitivity >= 0.8 :

print(best_metrics)

   Threshold Sensitivity Specificity       PPV  Accuracy
10      0.19   0.8378378   0.5892857 0.4025974 0.6510067

# Plot metrics to visualize trade-offs
metrics_df_long <- melt(metrics_df, id.vars = "Threshold", variable.name = "Metric", value.name = "Value")

plot_metrics <- ggplot(metrics_df_long, aes(x = Threshold, y = Value, color = Metric)) +
  geom_line() +
  geom_vline(xintercept = best_threshold, linetype = "dashed", color = "gray50") +
  geom_text(aes(x = best_threshold, y = 0.5, 
                label = paste("Best Threshold =", round(best_threshold, 2))),
            color = "gray50", angle = 90, hjust = -0.2, size = 2.5) +
  labs(title = "Model Performance Metrics by Threshold", 
       x = "Threshold", y = "Metric Value") +
  theme_bw() +
  scale_color_brewer(palette = "Set1")

print(plot_metrics)

# Save the plots and metrics if needed
# ggsave("figures/Obj1model_metrics_plotV2.png", plot_metrics)
# write.csv(metrics_df, "figures/Obj1model_metrics_by_thresholdV2.csv", row.names = FALSE)

I found a threshold (0.19), which allowed the model to achieve the largest PPV once a minimum Sensitivity of 0.80 was reached.

There is a trade-off between over-treating or under-treating. We have to consider the cost of someone at high risk having a fracture because they were not diagnosed and treated. Additionally, we need to weigh that against the cost of treating unnecessarily, e.g. inconvenience to the patient, cost of the treatment, and treatment side effects. If we optimize sensitivity with a lower threshold, then we err on the side of over-treating. It seems best to choose a minimum proportion of those at risk that we want to be certain receive treatment and after reaching that minimum try to minimize the cost of treating unnecessarily by making our PPV as high as possible to ensure our positive predictions are correct.

Here, I will get the coefficients for the final Objective 1 additive model for interpretation and set the threshold (using a minimum Sensitivity of at least 0.8 and a moderately high PPV) to get a confusion matrix.

# Fit the model on the training data
final_model <- glm(
  formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  data = training,
  family = "binomial"
)

# # Tue's
# final_model <- glm(
#   formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed,
#   data = training,
#   family = "binomial"
# )

# # Tom's
# final_model <- glm(
#   formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
#   data = training,
#   family = "binomial"
# )


# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")

# Get the coefficients
coef(summary(final_model))

                    Estimate Std. Error    z value   Pr(>|z|)
(Intercept)      3.726543862 3.51975849  1.0587499 0.28971371
priorfracYes     0.594130932 0.33028411  1.7988481 0.07204272
weight           0.009551798 0.01041919  0.9167505 0.35927340
height          -0.042148335 0.02239631 -1.8819317 0.05984529
premenoYes       0.277711261 0.34974297  0.7940439 0.42716989
momfracYes       0.866747721 0.38674545  2.2411323 0.02501751
armassistYes    -0.060558514 0.41071111 -0.1474480 0.88277845
smokeYes        -0.397622978 0.55300158 -0.7190268 0.47212439
rateriskSame     0.101713150 0.33774209  0.3011563 0.76329533
rateriskGreater  0.134134342 0.37673618  0.3560432 0.72180824
fracscore        0.193679485 0.08791951  2.2029180 0.02760053
bonemedYes      -0.302086003 0.50537118 -0.5977508 0.55000625
bonemed_fuYes    0.594836643 0.49523255  1.2011259 0.22970237

# Exponentiate the coefficients to get the odds ratios
exp(coef(summary(final_model))[,1])

    (Intercept)    priorfracYes          weight          height      premenoYes 
     41.5353081       1.8114560       1.0095976       0.9587276       1.3201050 
     momfracYes    armassistYes        smokeYes    rateriskSame rateriskGreater 
      2.3791606       0.9412387       0.6719153       1.1070659       1.1435464 
      fracscore      bonemedYes   bonemed_fuYes 
      1.2137072       0.7392745       1.8127348

# Get the confidence intervals
exp(confint(final_model))

                     2.5 %       97.5 %
(Intercept)     0.04302943 45026.592845
priorfracYes    0.94609161     3.466895
weight          0.98901171     1.030405
height          0.91686391     1.001397
premenoYes      0.65390473     2.593581
momfracYes      1.10571683     5.072012
armassistYes    0.41801097     2.101255
smokeYes        0.20491608     1.860272
rateriskSame    0.57123593     2.157652
rateriskGreater 0.54511300     2.399972
fracscore       1.02268969     1.445135
bonemedYes      0.27154307     1.995866
bonemed_fuYes   0.68076709     4.814276

# Calculate coefficients and confidence intervals via odds ratios
summary_table <- cbind(
  OR = exp(coef(summary(final_model))[, 1]), # Odds Ratios
  `Lower CI` = exp(confint(final_model)[, 1]), # Lower 95% CI
  `Upper CI` = exp(confint(final_model)[, 2]), # Upper 95% CI
  `p-value` = coef(summary(final_model))[, 4] # p-values
)

# Convert to a data frame for cleaner output
summary_df <- as.data.frame(summary_table)
summary_df <- cbind(Variable = rownames(summary_table), summary_df)
rownames(summary_df) <- NULL

# Print the summary table
Obj1Model_table = gt(summary_df)
Obj1Model_table

Variable	OR	Lower CI	Upper CI	p-value
(Intercept)	41.5353081	0.04302943	45026.592845	0.28971371
priorfracYes	1.8114560	0.94609161	3.466895	0.07204272
weight	1.0095976	0.98901171	1.030405	0.35927340
height	0.9587276	0.91686391	1.001397	0.05984529
premenoYes	1.3201050	0.65390473	2.593581	0.42716989
momfracYes	2.3791606	1.10571683	5.072012	0.02501751
armassistYes	0.9412387	0.41801097	2.101255	0.88277845
smokeYes	0.6719153	0.20491608	1.860272	0.47212439
rateriskSame	1.1070659	0.57123593	2.157652	0.76329533
rateriskGreater	1.1435464	0.54511300	2.399972	0.72180824
fracscore	1.2137072	1.02268969	1.445135	0.02760053
bonemedYes	0.7392745	0.27154307	1.995866	0.55000625
bonemed_fuYes	1.8127348	0.68076709	4.814276	0.22970237

# Save the table as a png for ppt
# gtsave(Obj1Model_table, "figures/Obj1Model_table.png")

# Categorical interpretation
# Exponentiate the priorfracYes coefficient to get the odds ratio
exp(coef(final_model)['momfracYes'])

momfracYes 
  2.379161

# Get the percent change in the odds of those with a history of fracture
(exp(coef(final_model)['momfracYes'])-1)*100

momfracYes 
  137.9161

# Get the confidence interval
exp(confint(final_model)['momfracYes',])

   2.5 %   97.5 % 
1.105717 5.072012

(exp(confint(final_model)['momfracYes',])-1)*100

    2.5 %    97.5 % 
 10.57168 407.20116

# Numerical interpretation
# Exponentiate the age coefficient to get the odds ratio
exp(coef(final_model)['fracscore'])

fracscore 
 1.213707

# Get the percent change in the odds with a 1 year increase in age
(exp(coef(final_model)['fracscore'])-1)*100

fracscore 
 21.37072

# Get the confidence interval
exp(confint(final_model)['fracscore',])

   2.5 %   97.5 % 
1.022690 1.445135

(exp(confint(final_model)['fracscore',])-1)*100

    2.5 %    97.5 % 
 2.268969 44.513502

# Change the threshold for 0.8+ sensitivity and highest PPV and create the confusion matrix
threshold <- 0.19

# Ensure consistent factor levels
obj1.preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))

# Confusion matrix with explicit positive class
conf_matrix <- confusionMatrix(data = obj1.preds, reference = validate$fracture, positive = "Yes")
print(conf_matrix)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  66   6
       Yes 46  31
                                          
               Accuracy : 0.651           
                 95% CI : (0.5687, 0.7272)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.9977          
                                          
                  Kappa : 0.3136          
                                          
 Mcnemar's Test P-Value : 6.362e-08       
                                          
            Sensitivity : 0.8378          
            Specificity : 0.5893          
         Pos Pred Value : 0.4026          
         Neg Pred Value : 0.9167          
             Prevalence : 0.2483          
         Detection Rate : 0.2081          
   Detection Prevalence : 0.5168          
      Balanced Accuracy : 0.7136          
                                          
       'Positive' Class : Yes

Here, I create an effects plot for each of the predictors in the Objective 1 model. Also for curiosity, I create the model with only the significant variables (at the \(\alpha = 0.10\) level) and compare some metrics.

# Just want to compare a model with only the significant predictors
# mymodel.red <- glm(fracture ~ priorfrac + momfrac + bonemed_fu, data=training, family="binomial")
mymodel.red <- glm(fracture ~ priorfrac + momfrac + fracscore + height, data=training, family="binomial")
summary(mymodel.red)


Call:
glm(formula = fracture ~ priorfrac + momfrac + fracscore + height, 
    family = "binomial", data = training)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)   4.48004    3.35247   1.336   0.1814   
priorfracYes  0.67701    0.31129   2.175   0.0296 * 
momfracYes    0.92543    0.36526   2.534   0.0113 * 
fracscore     0.16721    0.05956   2.807   0.0050 **
height       -0.04096    0.02077  -1.972   0.0486 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 395.31  on 350  degrees of freedom
Residual deviance: 358.57  on 346  degrees of freedom
AIC: 368.57

Number of Fisher Scoring iterations: 4

vif(final_model)

               GVIF Df GVIF^(1/(2*Df))
priorfrac  1.426703  1        1.194447
weight     1.756250  1        1.325236
height     1.202478  1        1.096576
premeno    1.115866  1        1.056346
momfrac    1.154333  1        1.074399
armassist  2.421225  1        1.556029
smoke      1.066679  1        1.032802
raterisk   1.360374  2        1.079977
fracscore  2.711928  1        1.646793
bonemed    3.154672  1        1.776140
bonemed_fu 3.253960  1        1.803873

vif(mymodel.red)

priorfrac   momfrac fracscore    height 
 1.284425  1.052896  1.282773  1.031656

hoslem.test(final_model$y,fitted(final_model))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  final_model$y, fitted(final_model)
X-squared = 3.454, df = 8, p-value = 0.9027

hoslem.test(mymodel.red$y,fitted(mymodel.red))


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  mymodel.red$y, fitted(mymodel.red)
X-squared = 11.97, df = 8, p-value = 0.1526

AIC(final_model)

[1] 379.9978

AIC(mymodel.red)

[1] 368.5736

# Make effects plots for each predictor one at a time
# Obj1 model: fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu
plot_model(final_model, type="pred",terms=c("priorfrac"))

plot_model(final_model,type="pred",terms=c("weight"))

plot_model(final_model,type="pred",terms=c("height"))

plot_model(final_model,type="pred",terms=c("premeno"))

plot_model(final_model, type="pred",terms=c("momfrac"))

plot_model(final_model, type="pred",terms=c("armassist"))

plot_model(final_model,type="pred",terms=c("smoke"))

plot_model(final_model,type="pred",terms=c("raterisk"))

plot_model(final_model,type="pred",terms=c("fracscore"))

plot_model(final_model,type="pred",terms=c("bonemed"))

plot_model(final_model,type="pred",terms=c("bonemed_fu"))

# Just in case for the two that are being interpreted
plot_model(final_model,type="pred",terms=c("fracscore", "momfrac"))

# # Function to save a single effects plot
# save_effects_plot <- function(model, term, filename) {
#   # Generate the plot
#   effect_plot <- plot_model(model, type = "pred", terms = term)
#   
#   # Save the plot as a PNG file
#   ggsave(filename, plot = effect_plot, width = 8, height = 6, dpi = 300)
# }
# 
# # Save effects plots for each predictor in the final model
# save_effects_plot(final_model, "priorfrac", "figuresKH/effectKH_priorfrac.png")
# save_effects_plot(final_model, "weight", "figuresKH/effectKH_weight.png")
# save_effects_plot(final_model, "height", "figuresKH/effectKH_height.png")
# save_effects_plot(final_model, "premeno", "figuresKH/effectKH_premeno.png")
# save_effects_plot(final_model, "momfrac", "figuresKH/effectKH_momfrac.png")
# save_effects_plot(final_model, "armassist", "figuresKH/effectKH_armassist.png")
# save_effects_plot(final_model, "smoke", "figuresKH/effectKH_smoke.png")
# save_effects_plot(final_model, "raterisk", "figuresKH/effectKH_raterisk.png")
# save_effects_plot(final_model, "fracscore", "figuresKH/effectKH_fracscore.png")
# save_effects_plot(final_model, "bonemed", "figuresKH/effectKH_bonemed.png")
# save_effects_plot(final_model, "bonemed_fu", "figuresKH/effectKH_bonemed_fu.png")
# save_effects_plot(final_model, c("fracscore", "momfrac"), "figuresKH/dualeffectsKH_fracscore_momfrac.png")

Objective 1 Findings

Model fitting approach:
For intuition on the most important predictors, GLMNET, LASSO and stepwise feature selection methods were used with 10-fold cross-validation on the training split (70/30). Both the logloss error and the AUROC metrics were used for final model selection. The relative frequency of each of the predictors among the feature selection models helped inform the next steps. A manual process was used to compare models formulated with the most frequent predictors from feature selection, but mainly with intuition from EDA. Candidate models were selected by adjusting each model’s prediction threshold based on a constant PPV and comparing metrics including AUROC, sensitivity and specificity. Only models with reasonable VIF values and insignificant Hosmer-Lemeshow p-values were selected.

Selected variables:
fracscore, and priorfrac were included because visual evidence from EDA suggested that those three variables seemed most likely to influence the probability of fracture. priorfrac was present in all three feature selection models and fracscore in two of three. age was included in one feature selection model and did seem important in EDA but age and fracscore were highly correlated and resulted in high VIFs, so age was excluded.
armassist, momfrac, raterisk, height, bonemed, and bonemed_fu also seemed informative based on EDA. momfrac and height were in two and armassist and bonemed_fu were in one of three feature selection models. Multicollinearity was a concern among the medication variables. I opted to exclude bonetreat because it was an interaction term of the other two. bmi was excluded because it is an interaction of weight and height. weight, premeno and smoke seemed less important from EDA and in feature selection but manually adding them to candidate models improved their performance metrics.

The formula of the final Objective 1 model is: fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu.

The logistic model is:
\[ \log \Big( \frac{\pi}{1-\pi} \Big) = 3.73 + 0.59\text{priorfracYes} + 0.01\text{weight} - 0.04\text{height} + 0.28\text{premenoYes} + 0.87\text{momfracYes} - 0.06\text{armassistYes} - \\ 0.40\text{smokeYes} + 0.10\text{rateriskSame} + 0.13\text{rateriskGreater} + 0.19\text{fracscore} - 0.30\text{bonemedYes} + 0.59\text{bonemed_fuYes} \]

Interpretation of coefficients:

There is sufficient evidence at the \(\alpha = 0.05\) level of significance to suggest that for women with a maternal history of hip fracture, the odds of a fracture within one year of starting the ostoeporosis study are 137.9% higher than women whose mom’s did not have a fracture, receiving the same medication treatments, of the same height weight, relative age of menopause, personal history of fracture, smoking status, fracture score, use of assistance standing up and self accessed risk (an increase of a factor of \(e^{2.3791}\), p-value = 0.0250), 95% CI: [10.6%, 407.2%].

There is also sufficient evidence at the \(\alpha = 0.05\) level of significance to suggest that for every one unit increase in fracscore, the odds of fracture increase by 21.4% for women receiving the same medication treatments, of the same height, weight, relative age of menopause, history of fracture and maternal fracture, smoking status, use of assistance standing up and self accessed risk, (a factor of \(e^{1.214}\), p-value = 0.0276), 95% CI: [2.3%, 44.5%].

Practical versus statistical significance:
When accounting for all of the variables in the model, only two were statistically significant at the \(\alpha = 0.05\) level in explaining the relationship with fracture, momfrac and fracscore. The significance level of those predictors and two others height and priorfrac increased when the other seven variables were not included in the model. However including the others decreased the error and increased the ability to predict on the validation set, highlighting their practical significance in explaining the relationship with fracture.

With a threshold of 0.19 to prioritize treating high risk individuals while maintain reasonable avoidance of unnecessary intervention for low risk individuals, the model achieves the following metrics:
- AUROC: 0.74
- Sensitivity: 0.84
- Specificity: 0.59
- PPV: 0.40
- NPV: 0.92
- Prevalence: 0.25
- Logloss Validation: 0.51
- Logloss CV: 0.54
- MaxVIF: 3.25
- Hoslem p-value: 0.90

Objective 2: Models for Prediction

The next goal is to improve prediction performance metrics by producing a logistic regression model with added complexity, a discriminant analysis model, and a non-parametric model.

Complex Logistic Regression Model

In developing a complex logistic regression model for Objective 2, I referred back to EDA for possible interactions, add those to the final model from Objective 1 and then performed feature selection. Again, the stepwise feature selection code is commented out to keep the output shorter.

# Objective 1 model with added interactions based on EDA
formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:bonemed_fu + priorfrac:fracscore + priorfrac:weight + momfrac:armassist + momfrac:raterisk + momfrac:bonemed + momfrac:bonemed_fu + momfrac:smoke + smoke:height + smoke:raterisk + smoke:bonemed + smoke:bonemed_fu + raterisk:premeno + raterisk:bonemed + raterisk:bonemed_fu + raterisk:weight + weight:bonemed + weight:bonemed_fu

# Here, I use penalized regression for feature selection and stepwise feature selection after adding interaction terms. I'd like to get a sense of the important predictors.    

# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)

# GLMNET
set.seed(1234)
glmnet.fit<-train(formula,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)

40 x 1 sparse Matrix of class "dgCMatrix"
                              s=0.02283267
(Intercept)                    1.476476766
priorfracYes                   .          
weight                         .          
height                        -0.021304276
premenoYes                     .          
momfracYes                     0.659095437
armassistYes                   .          
smokeYes                      -0.110653661
rateriskSame                   .          
rateriskGreater                .          
fracscore                      0.134976766
bonemedYes                     .          
bonemed_fuYes                  .          
priorfracYes:premenoYes        .          
priorfracYes:momfracYes        .          
priorfracYes:bonemedYes        .          
priorfracYes:bonemed_fuYes     .          
priorfracYes:fracscore         .          
priorfracYes:weight            0.006283359
momfracYes:armassistYes        .          
momfracYes:rateriskSame        .          
momfracYes:rateriskGreater     .          
momfracYes:bonemedYes         -0.414588020
momfracYes:bonemed_fuYes       .          
momfracYes:smokeYes            .          
height:smokeYes                .          
smokeYes:rateriskSame          .          
smokeYes:rateriskGreater       .          
smokeYes:bonemedYes            1.738869292
smokeYes:bonemed_fuYes         .          
premenoYes:rateriskSame        .          
premenoYes:rateriskGreater     .          
rateriskSame:bonemedYes        .          
rateriskGreater:bonemedYes     .          
rateriskSame:bonemed_fuYes     0.332737131
rateriskGreater:bonemed_fuYes  .          
weight:rateriskSame            .          
weight:rateriskGreater         0.001438714
weight:bonemedYes              .          
weight:bonemed_fuYes           0.001807696

# includes height, momfrac, smoke, fracscore, priorfrac:weight, momfrac:bonemed, smoke:bonemed, raterisk:bonemed_fu, weight:raterisk, weight_bonemed_fu


# LASSO
set.seed(1234)
lasso.fit<-train(formula,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss",
                    tuneGrid = expand.grid(alpha = 1,  # Lasso penalty
                                          lambda = seq(0.01, 10, length = 200))  # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)

40 x 1 sparse Matrix of class "dgCMatrix"
                                    s=0.01
(Intercept)                    3.498543544
priorfracYes                   .          
weight                         .          
height                        -0.035913319
premenoYes                     0.343514036
momfracYes                     1.058698672
armassistYes                   .          
smokeYes                      -0.614588877
rateriskSame                   .          
rateriskGreater                .          
fracscore                      0.164349002
bonemedYes                     .          
bonemed_fuYes                  .          
priorfracYes:premenoYes       -0.718306487
priorfracYes:momfracYes        0.005806747
priorfracYes:bonemedYes       -0.313078405
priorfracYes:bonemed_fuYes     .          
priorfracYes:fracscore         .          
priorfracYes:weight            0.009992061
momfracYes:armassistYes        .          
momfracYes:rateriskSame        0.049235185
momfracYes:rateriskGreater     .          
momfracYes:bonemedYes         -1.280908089
momfracYes:bonemed_fuYes       .          
momfracYes:smokeYes           -0.390256510
height:smokeYes                .          
smokeYes:rateriskSame          .          
smokeYes:rateriskGreater       .          
smokeYes:bonemedYes            3.325398975
smokeYes:bonemed_fuYes         .          
premenoYes:rateriskSame        .          
premenoYes:rateriskGreater     .          
rateriskSame:bonemedYes        .          
rateriskGreater:bonemedYes    -0.127412189
rateriskSame:bonemed_fuYes     0.607607673
rateriskGreater:bonemed_fuYes  .          
weight:rateriskSame            .          
weight:rateriskGreater         0.004563678
weight:bonemedYes              .          
weight:bonemed_fuYes           0.003436180

# height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu


#STEPWISE
set.seed(1234)
# Step.fit<-train(formula,
#                     data=training,
#                     method="glmStepAIC",
#                     trControl=fitControl,
#                     metric="logLoss")
# coef(Step.fit$finalModel)
# priorfrac, height, premeno, momfrac, smoke, fracscore, bonemed_fu, priorfrac:premeno, priorfrac:bonemed, momfrac:bonemed, height:smoke, smoke:raterisk, smoke:bonemed, raterisk:bonemed, raterisk:bonemed_fu, weight:bonemed

Here are the variables in frequency of appearance in feature selection models:
3: height, momfrac, smoke, fracscore, momfrac:bonemed, smoke:bonemed, raterisk:bonemed_fu
2: premeno, priorfrac:premeno, priorfrac:bonemed, priorfrac:weight, weight:raterisk, weight:bonemed_fu, raterisk:bonemed
1: priorfrac, bonemed_fu, priorfrac:momfrac, momfrac:raterisk, momfrac:smoke, height:smoke, smoke:raterisk, weight:bonemed

Here, I do the same thing as above including polynomial terms also. I refer back to EDA for possible polynomial trends (in weight), add those to the final model from Objective 1, and then perform feature selection. Again, the stepwise feature selection code is commented out to keep the output shorter.

# Objective 1 model with added interactions and polynomial terms based on EDA
formula2 = fracture ~ priorfrac + weight + poly(weight, 2) + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:bonemed_fu + priorfrac:fracscore + priorfrac:weight + priorfrac:poly(weight, 2) + momfrac:armassist + momfrac:raterisk + momfrac:bonemed + momfrac:bonemed_fu + momfrac:smoke + smoke:height + smoke:raterisk + smoke:bonemed + smoke:bonemed_fu + raterisk:premeno + raterisk:bonemed + raterisk:bonemed_fu + raterisk:weight + raterisk:poly(weight, 2) + weight:bonemed + poly(weight, 2):bonemed + weight:bonemed_fu + poly(weight, 2):bonemed_fu

# Here, I use penalized regression for feature selection and stepwise feature selection after including the added interactions above and adding polynomial terms for weight. I'd like to get a sense of the important predictors.    

# logLoss error metric
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)

# GLMNET
set.seed(1234)
glmnet.fit<-train(formula2,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss"
)
coef(glmnet.fit$finalModel,glmnet.fit$finalModel$lambdaOpt)

52 x 1 sparse Matrix of class "dgCMatrix"
                                  s=0.02283267
(Intercept)                       2.0731617473
priorfracYes                      .           
weight                            .           
poly(weight, 2)1                  .           
poly(weight, 2)2                 -0.0009882277
height                           -0.0249281452
premenoYes                        .           
momfracYes                        0.6755778494
armassistYes                      .           
smokeYes                         -0.0962459861
rateriskSame                      .           
rateriskGreater                   .           
fracscore                         0.1385583904
bonemedYes                        .           
bonemed_fuYes                     .           
priorfracYes:premenoYes           .           
priorfracYes:momfracYes           .           
priorfracYes:bonemedYes           .           
priorfracYes:bonemed_fuYes        .           
priorfracYes:fracscore            .           
priorfracYes:weight               0.0056124645
priorfracYes:poly(weight, 2)1     .           
priorfracYes:poly(weight, 2)2     .           
momfracYes:armassistYes           .           
momfracYes:rateriskSame           .           
momfracYes:rateriskGreater        .           
momfracYes:bonemedYes            -0.3756747469
momfracYes:bonemed_fuYes          .           
momfracYes:smokeYes               .           
height:smokeYes                   .           
smokeYes:rateriskSame             .           
smokeYes:rateriskGreater          .           
smokeYes:bonemedYes               1.8299195816
smokeYes:bonemed_fuYes            .           
premenoYes:rateriskSame           .           
premenoYes:rateriskGreater        .           
rateriskSame:bonemedYes           .           
rateriskGreater:bonemedYes        .           
rateriskSame:bonemed_fuYes        0.2622511787
rateriskGreater:bonemed_fuYes     .           
weight:rateriskSame               .           
weight:rateriskGreater            0.0008896317
poly(weight, 2)1:rateriskSame     .           
poly(weight, 2)2:rateriskSame     .           
poly(weight, 2)1:rateriskGreater  3.1471920371
poly(weight, 2)2:rateriskGreater  .           
weight:bonemedYes                 .           
poly(weight, 2)1:bonemedYes       .           
poly(weight, 2)2:bonemedYes       0.0169743280
weight:bonemed_fuYes              0.0023961222
poly(weight, 2)1:bonemed_fuYes    .           
poly(weight, 2)2:bonemed_fuYes    .

# includes poly(weight, 2), height, momfrac, smoke, fracscore, priorfrac:weight, momfrac:bonemed, smoke:bonemed, raterisk:bonemed_fu, weight:raterisk, poly(weight, 2):raterisk, poly(weight, 2):bonemed, weight_bonemed_fu


# LASSO
set.seed(1234)
lasso.fit<-train(formula2,
                    data=training,
                    method="glmnet",
                    trControl=fitControl,
                    metric="logLoss",
                    tuneGrid = expand.grid(alpha = 1,  # Lasso penalty
                                          lambda = seq(0.01, 10, length = 200))  # Range of lambda values to test
)
coef(lasso.fit$finalModel,lasso.fit$finalModel$lambdaOpt)

52 x 1 sparse Matrix of class "dgCMatrix"
                                 s=0.06020101
(Intercept)                      -1.525625420
priorfracYes                      .          
weight                            .          
poly(weight, 2)1                  .          
poly(weight, 2)2                  .          
height                            .          
premenoYes                        .          
momfracYes                        .          
armassistYes                      .          
smokeYes                          .          
rateriskSame                      .          
rateriskGreater                   .          
fracscore                         0.096590670
bonemedYes                        .          
bonemed_fuYes                     .          
priorfracYes:premenoYes           .          
priorfracYes:momfracYes           .          
priorfracYes:bonemedYes           .          
priorfracYes:bonemed_fuYes        .          
priorfracYes:fracscore            .          
priorfracYes:weight               0.002845022
priorfracYes:poly(weight, 2)1     .          
priorfracYes:poly(weight, 2)2     .          
momfracYes:armassistYes           .          
momfracYes:rateriskSame           .          
momfracYes:rateriskGreater        .          
momfracYes:bonemedYes             .          
momfracYes:bonemed_fuYes          .          
momfracYes:smokeYes               .          
height:smokeYes                   .          
smokeYes:rateriskSame             .          
smokeYes:rateriskGreater          .          
smokeYes:bonemedYes               0.100878082
smokeYes:bonemed_fuYes            .          
premenoYes:rateriskSame           .          
premenoYes:rateriskGreater        .          
rateriskSame:bonemedYes           .          
rateriskGreater:bonemedYes        .          
rateriskSame:bonemed_fuYes        .          
rateriskGreater:bonemed_fuYes     .          
weight:rateriskSame               .          
weight:rateriskGreater            .          
poly(weight, 2)1:rateriskSame     .          
poly(weight, 2)2:rateriskSame     .          
poly(weight, 2)1:rateriskGreater  .          
poly(weight, 2)2:rateriskGreater  .          
weight:bonemedYes                 .          
poly(weight, 2)1:bonemedYes       .          
poly(weight, 2)2:bonemedYes       .          
weight:bonemed_fuYes              .          
poly(weight, 2)1:bonemed_fuYes    .          
poly(weight, 2)2:bonemed_fuYes    .

# fracscore, priorfrac:weight, smoke:bonemed


#STEPWISE
# set.seed(1234)
# #note CV and error metric are not really used here, but logLoss is reported for the final model.
# Step.fit<-train(formula2,
#                     data=training,
#                     method="glmStepAIC",
#                     trControl=fitControl,
#                     metric="logLoss")
# coef(Step.fit$finalModel)
# poly(weight, 2), height, premeno, smoke, fracscore, bonemed_fu, priorfrac:premeno, priorfrac:bonemed, priorfrac:weight, priorfrac:poly(weight, 2), momfrac:raterisk, momfrac:bonemed, height:smoke, smoke:raterisk, smoke:bonemed, raterisk:bonemed, raterisk:bonemed_fu, weight:bonemed, poly(weight, 2):bonemed

Here are the variables in frequency of appearance in feature selection models:
- 3: fracscore, priorfrac:weight, smoke:bonemed
- 2: poly(weight, 2), height, smoke, momfrac:bonemed, raterisk:bonemed_fu, poly(weight, 2):bonemed
- 1: momfrac, premeno, bonemed_fu, weight:raterisk, poly(weight, 2):raterisk, weight:bonemed_fu, priorfrac:premeno, priorfrac:bonemed, priorfrac:poly(weight, 2), momfrac:raterisk, smoke:raterisk, raterisk:bonemed, weight:bonemed, poly(weight, 2):bonemed

Here, I fit a bunch of models based on the feature selection steps above. I compare the number of predictors, logloss error from both the k-fold cross-validation and validation set, AUROC, sensitivity, specificity, PPV, NPV, and prevalence based on a threshold calculated to achieve a set PPV to make comparison of the other metrics easier.

# List of model formulas
models <- list(
  obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  # intfit1 = fracture ~ height + momfrac + smoke + fracscore + priorfrac:weight + momfrac:bonemed + smoke:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu,
  # intfit2_orig = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:raterisk + weight:bonemed_fu,
  # intfit2_modA = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:momfrac + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed,
  # intfit2_modB = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed,
  # intfit2b = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  intfit2 = fracture ~ priorfrac + bonemed + bonemed_fu + height + weight + raterisk + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  # intfit3 = fracture ~ priorfrac + height + premeno + momfrac + smoke + fracscore + bonemed_fu + priorfrac:premeno + priorfrac:bonemed + momfrac:bonemed + height:smoke + smoke:raterisk + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:bonemed, # borderline hoslem
  # intfit4 = fracture ~ height + momfrac + smoke + fracscore + premeno + priorfrac + bonemed_fu + momfrac:bonemed +  priorfrac:premeno + priorfrac:bonemed + priorfrac:momfrac + momfrac:smoke + height:smoke, # low hoslem
  # polyfit1 = fracture ~ poly(weight, 2) + height + momfrac + smoke + fracscore + priorfrac:weight + momfrac:bonemed + smoke:bonemed + raterisk:bonemed_fu + weight:raterisk + poly(weight, 2):raterisk + poly(weight, 2):bonemed + weight:bonemed_fu, # low AUROC
  # polyfit2 = fracture ~ fracscore + priorfrac:weight + smoke:bonemed, # low AUROC
  # polyfit3 = fracture ~  poly(weight, 2) + height + premeno + smoke + fracscore + bonemed_fu + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + momfrac:raterisk + momfrac:bonemed + height:smoke + smoke:raterisk + smoke:bonemed + raterisk:bonemed + raterisk:bonemed_fu + weight:bonemed + poly(weight, 2):bonemed, # high logloss
  polyfit4 = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac +  priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)

# Set seed for reproducibility
set.seed(1234)

# Set up trainControl for repeated CV
fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = TRUE,
  summaryFunction = mnLogLoss
)

# Create a list to store CV results for each model
cv_results <- lapply(names(models), function(model_name) {
  model_formula <- models[[model_name]]
  train_result <- train(
    model_formula,
    data = training,
    method = "glm",
    family = "binomial",
    trControl = fitControl,
    metric = "logLoss"
  )
  # Return the train object and minimum log loss
  list(model = train_result, min_logloss_cv = min(train_result$results$logLoss, na.rm = TRUE))
})

# Safely calculate confusion matrix metrics
calc_confusion_metrics <- function(confusion) {
  sensitivity <- ifelse("Yes" %in% colnames(confusion), 
                        confusion["Yes", "Yes"] / sum(confusion[, "Yes"]), NA)
  specificity <- ifelse("No" %in% colnames(confusion), 
                        confusion["No", "No"] / sum(confusion[, "No"]), NA)
  ppv <- ifelse("Yes" %in% rownames(confusion), 
                confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
  npv <- ifelse("No" %in% rownames(confusion), 
                confusion["No", "No"] / sum(confusion["No", ]), NA)
  accuracy <- sum(diag(confusion)) / sum(confusion)
  misclassification_rate <- 1 - accuracy
  f1_score <- ifelse(!is.na(ppv) & !is.na(sensitivity) & (ppv + sensitivity) > 0,
                     2 * (ppv * sensitivity) / (ppv + sensitivity), NA)
  list(
    sensitivity = sensitivity, 
    specificity = specificity, 
    ppv = ppv, 
    npv = npv, 
    f1_score = f1_score, 
    misclassification_rate = misclassification_rate
  )
}

# Fit models and calculate metrics
results <- lapply(seq_along(models), function(i) {
  model_name <- names(models)[i]
  model_formula <- models[[model_name]]

  # Fit the model using glm
  model <- glm(model_formula, data = training, family = "binomial")

  # Predict probabilities on the validation set
  pred_probs <- predict(model, newdata = validate, type = "response")

  # Calculate ROC and AUC
  roc_obj <- pROC::roc(response = validate$fracture, predictor = pred_probs, levels = c("No", "Yes"))
  auc_value <- as.numeric(pROC::auc(roc_obj))

  # Calculate Log Loss on the validation set
  logloss_validation <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = pred_probs)

  # Retrieve Log Loss from CV
  cv_logloss <- cv_results[[i]]$min_logloss_cv

  # Determine the best threshold for a desired PPV
  desired_ppv <- 0.50
  all_thresholds <- seq(0, 1, by = 0.01)
  threshold_metrics <- sapply(all_thresholds, function(threshold) {
    predicted_class <- ifelse(pred_probs >= threshold, "Yes", "No")
    confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
    ppv <- ifelse("Yes" %in% rownames(confusion), 
                  confusion["Yes", "Yes"] / sum(confusion["Yes", ]), NA)
    abs(ppv - desired_ppv)
  })
  best_threshold <- all_thresholds[which.min(threshold_metrics)]

  # Metrics at the best threshold
  predicted_class <- ifelse(pred_probs >= best_threshold, "Yes", "No")
  confusion <- table(Predicted = predicted_class, Actual = validate$fracture)
  metrics <- calc_confusion_metrics(confusion)

  # Perform the Hosmer-Lemeshow test
  n_groups <- min(10, floor(nrow(training) / 5))
  hoslem_test <- tryCatch(hoslem.test(model$y, fitted(model), g = n_groups), error = function(e) NULL)
  hoslem_pval <- if (!is.null(hoslem_test)) hoslem_test$p.value else NA

  # Return all metrics
  list(
    Model = model_name,
    AUROC = auc_value,
    LogLoss_Validation = logloss_validation,
    LogLoss_CV = cv_logloss,
    HoslemPValue = hoslem_pval,
    Sensitivity = metrics$sensitivity,
    Specificity = metrics$specificity,
    PPV = metrics$ppv,
    NPV = metrics$npv,
    F1_Score = metrics$f1_score,
    MisclassificationRate = metrics$misclassification_rate,
    Threshold = best_threshold
  )
})

# Combine results into a data frame
results_df <- do.call(rbind, lapply(results, as.data.frame))

# Sort and display results
results_df_sorted <- results_df[order(-results_df$AUROC, results_df$LogLoss_Validation), ]
print(results_df_sorted)

     Model     AUROC LogLoss_Validation LogLoss_CV HoslemPValue Sensitivity
2  intfit2 0.7625483          0.5000117  0.6159578    0.3592725   0.6756757
1  obj1fit 0.7434846          0.5111676  0.5471671    0.9027294   0.5135135
3 polyfit4 0.7379344          0.5339562  0.5336331    0.4554668   0.7567568
  Specificity PPV       NPV  F1_Score MisclassificationRate Threshold
2   0.7767857 0.5 0.8787879 0.5747126             0.2483221      0.29
1   0.8303571 0.5 0.8378378 0.5066667             0.2483221      0.32
3   0.7500000 0.5 0.9032258 0.6021505             0.2483221      0.27

After adjusting the model’s prediction thresholds based on a constant PPV, two complex models seems to perform pretty well, with a comparatively low logloss, high AUROC, high sensitivity. Additionally, the Hosmer-Lemeshow test doesn’t suggest evidence against goodness of fit.

Objective 1 Model: fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu

Obj 2 Model with Interactions (intfit2): fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu

Obj 2 Model with Interactions and Polynomials (polyfit4): fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed

Here, I plot ROC curves of the Objective 1 model and the Objective 2 candidate models.

# Fit the models
# Tue's
tue <- glm(
  formula = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed,
  data = training,
  family = "binomial"
)

# Tom's
tom <- glm(
  formula = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
  data = training,
  family = "binomial"
)

obj1fit = glm(fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu, data = training, family = "binomial")

intfit2 = glm(fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu, data = training, family = "binomial")

polyfit4 = glm(fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac +  priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed, data = training, family = "binomial")


# Predicting probabilities on the validation data
tue.predprobs <- predict(tue, validate, type = "response")
tom.predprobs <- predict(tom, validate, type = "response")
obj1fit.predprobs <- predict(obj1fit, validate, type = "response")
intfit2.predprobs <- predict(intfit2, validate, type = "response")
polyfit4.predprobs <- predict(polyfit4, validate, type = "response")

# Calculate ROC and AUROC
tue.roc <- roc(response = validate$fracture, predictor = tom.predprobs, levels = c("No", "Yes"))
tom.roc <- roc(response = validate$fracture, predictor = tue.predprobs, levels = c("No", "Yes"))
obj1fit.roc <- roc(response = validate$fracture, predictor = obj1fit.predprobs, levels = c("No", "Yes"))
intfit2.roc <- roc(response = validate$fracture, predictor = intfit2.predprobs, levels = c("No", "Yes"))
polyfit4.roc <- roc(response = validate$fracture, predictor = polyfit4.predprobs, levels = c("No", "Yes"))

# Extract AUROC values
tue.auc <- round(auc(obj1fit.roc), 3)
tom.auc <- round(auc(obj1fit.roc), 3)
obj1fit.auc <- round(auc(obj1fit.roc), 3)
intfit2.auc <- round(auc(intfit2.roc), 3)
polyfit4.auc <- round(auc(polyfit4.roc), 3)

# Save plot to a PNG file
# png("figures/ROC_Curves_with_AUROC.png", width = 800, height = 600)

# Plot ROC curves
plot(obj1fit.roc, col = "gray40", lwd = 2, lty = 1, main = "ROC Curves for Candidate Models") #print.thres="best"
plot(tue.roc, col = "purple3", lwd = 2, lty = 1, add = TRUE)
plot(tom.roc, col = "green4", lwd = 2, lty = 1, add = TRUE)
plot(intfit2.roc, col = "red3", lwd = 2, lty = 1, add = TRUE)
plot(polyfit4.roc, col = "blue2", lwd = 2, lty = 1, add = TRUE) 

# Add legend with AUROC values
legend("bottomright",
       legend = c(
         paste("Obj1 (AUROC =", obj1fit.auc, ")"),
         paste("Tue (AUROC =", tue.auc, ")"),
         paste("Tom (AUROC =", tom.auc, ")"),
         paste("Int (AUROC =", intfit2.auc, ")"),
         paste("Poly (AUROC =", polyfit4.auc, ")")
       ),
       col = c("gray40", "purple3","green4", "red3", "blue2"),
       lwd = 2, lty = c(1, 1, 1, 1, 1), cex = 1, xpd = TRUE, horiz = FALSE)

# Close the graphics device
# dev.off()

Here as before, I want to optimize the threshold of the chosen model(s) to achieve the highest PPV possible with a specificity of at least 0.80. I want to ensure that at least 80% of the patients at high risk of fractures receive treatment and minimize the number of women at low risk that are treated unnecessarily.

# Fit the model on the training data
final_model <- glm(
  formula = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  data = training,
  family = "binomial"
)

# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")

# Define a target Sensitivity threshold
target_sensitivity <- 0.8  # Adjust to 0.75, 0.85, or desired value

# Function to calculate performance metrics for a given threshold
calculate_metrics <- function(threshold, pred_probs, actual) {
  # Predicted classes based on threshold
  pred_class <- ifelse(pred_probs > threshold, "Yes", "No")
  
  # Ensure factors have the same levels
  actual <- factor(actual, levels = c("No", "Yes"))
  pred_class <- factor(pred_class, levels = c("No", "Yes"))
  
  # Generate the confusion matrix
  cm <- table(Predicted = pred_class, Actual = actual)
  
  # Extract counts from the confusion matrix
  tp <- cm["Yes", "Yes"]  # True Positives
  fp <- cm["Yes", "No"]   # False Positives
  tn <- cm["No", "No"]    # True Negatives
  fn <- cm["No", "Yes"]   # False Negatives
  
  # Safely calculate metrics
  ppv <- ifelse((tp + fp) > 0, tp / (tp + fp), NA)  # Positive Predictive Value (Precision)
  sensitivity <- ifelse((tp + fn) > 0, tp / (tp + fn), NA)  # Sensitivity (Recall)
  specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA)  # Specificity
  accuracy <- ifelse((tp + tn + fp + fn) > 0, (tp + tn) / (tp + tn + fp + fn), NA)  # Accuracy

  # Return metrics as a data frame
  metrics <- data.frame(
    Threshold = threshold,
    Sensitivity = sensitivity,
    Specificity = specificity,
    PPV = ppv,
    Accuracy = accuracy
  )
  return(metrics)
}

# Generate metrics for a refined range of thresholds
thresholds <- seq(0.1, 0.4, by = 0.01)  # Narrowed range
metrics_list <- lapply(thresholds, calculate_metrics, pred_probs = pred_probs, actual = validate$fracture)
metrics_df <- do.call(rbind, metrics_list)

# Filter metrics for Sensitivity >= target_sensitivity
valid_metrics <- metrics_df[metrics_df$Sensitivity >= target_sensitivity, ]

# Ensure there are valid thresholds
if (nrow(valid_metrics) == 0) {
  stop("No thresholds meet the target Sensitivity. Try lowering target_sensitivity.")
}

# Find the best threshold (maximizing PPV)
best_threshold <- valid_metrics$Threshold[which.max(valid_metrics$PPV)]
best_metrics <- valid_metrics[valid_metrics$Threshold == best_threshold, ]

# Output the best threshold and its metrics
cat("Best threshold with Sensitivity >=", target_sensitivity, ":\n")

Best threshold with Sensitivity >= 0.8 :

print(best_metrics)

   Threshold Sensitivity Specificity       PPV  Accuracy
11       0.2   0.8108108   0.6428571 0.4285714 0.6845638

# Plot metrics to visualize trade-offs
metrics_df_long <- melt(metrics_df, id.vars = "Threshold", variable.name = "Metric", value.name = "Value")

plot_metrics <- ggplot(metrics_df_long, aes(x = Threshold, y = Value, color = Metric)) +
  geom_line() +
  geom_vline(xintercept = best_threshold, linetype = "dashed", color = "gray50") +
  geom_text(aes(x = best_threshold, y = 0.5, 
                label = paste("Best Threshold =", round(best_threshold, 2))),
            color = "gray50", angle = 90, hjust = -0.2, size = 2.5) +
  labs(title = "Model Performance Metrics by Threshold", 
       x = "Threshold", y = "Metric Value") +
  theme_bw() +
  scale_color_brewer(palette = "Set1")

print(plot_metrics)

# Save the plots and metrics if needed
# ggsave("figures/Obj2model_metrics_plotV2.png", plot_metrics)
# write.csv(metrics_df, "figures/Obj2model_metrics_by_thresholdV2.csv", row.names = FALSE)

I found a threshold (0.20), which allowed the model to achieve the largest PPV once a minimum Sensitivity of 0.80 was reached.

There is a trade-off between over-treating or under-treating. We have to consider the cost of someone at high risk having a fracture because they were not diagnosed and treated. Additionally we need to weigh that against the cost of treating unnecessarily, e.g. inconvenience to the patient, cost of the treatment, and treatment side effects. If we optimize sensitivity with a lower threshold, then we err on the side of over-treating. It seems best to choose a minimum proportion of those at risk that we want to be certain receive treatment and after reaching that minimum try to minimize the cost of treating unnecessarily by making our PPV as high as possible to ensure our positive predictions are correct.

This looks at the bias-variance trade-off.

# Define models of increasing complexity
models <- list(
  "Simple" = fracture ~ priorfrac + momfrac + bonemed_fu,
  "Additive" = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  "Tue" = fracture ~ priorfrac + age + weight + height + momfrac + armassist +  raterisk + bonemed,
  "Tom" = fracture ~ priorfrac + fracscore + weight + height + momfrac + armassist + raterisk + bonemed_fu,
  "Multiplicative" = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu, 
  "Quadratic" = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac +  priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)

# Initialize an empty data frame to store errors and metrics
errors <- data.frame()

# Train models and calculate errors
set.seed(1234)
for (i in seq_along(models)) {
  model_name <- names(models)[i]
  model_formula <- models[[i]]

  # Fit model
  model <- glm(model_formula, data = training, family = "binomial")
  
  # Predict on training and validation sets
  train_preds <- predict(model, newdata = training, type = "response")
  val_preds <- predict(model, newdata = validate, type = "response")

  # Compute log-loss (or classification error)
  train_error <- Metrics::logLoss(actual = ifelse(training$fracture == "Yes", 1, 0), predicted = train_preds)
  val_error <- Metrics::logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = val_preds)
  
  # Calculate ROC metric for the validation set
  roc_obj <- pROC::roc(response = validate$fracture, predictor = val_preds, levels = c("No", "Yes"))
  roc_auc <- as.numeric(pROC::auc(roc_obj))
  
  # Store results
  errors <- rbind(errors, data.frame(
    Model = model_name,
    Complexity = i,  # Higher index -> more complex
    TrainingError = train_error,
    ValidationError = val_error,
    ROC = roc_auc
  ))
}

# Plot training and validation errors vs complexity
ggplot(errors, aes(x = Complexity)) +
  geom_line(aes(y = TrainingError, color = "Training Logloss Error"), linewidth = 1) +
  geom_line(aes(y = ValidationError, color = "Validation Logloss Error"), linewidth = 1) +
  geom_line(aes(y = ROC, color = "ROC"), linewidth = 1, linetype = "dotted") +
  geom_point(aes(y = TrainingError, color = "Training Logloss Error"), size = 1.5) +
  geom_point(aes(y = ValidationError, color = "Validation Logloss Error"), size = 1.5) +
  geom_point(aes(y = ROC, color = "ROC"), size = 1) +
  scale_x_continuous(breaks = 1:length(models), labels = names(models)) +
  labs(
    title = "Bias-Variance Tradeoff",
    x = "Model Complexity",
    y = "Metric",
    color = "Metric Type"
  ) +
  theme_bw() + 
  scale_color_manual(values = c("gray30", "turquoise4", "darkorange3"))

Here, I will get the coefficients for the final Objective 2 model for prediction and set a threshold balancing the trade-offs between misclassifications, over-treating, and undertreating to get a confusion matrix. Optimizing the threshold is very subjective, and it is something we may want to revisit. I have choosen to achieve at least 80% sensitivity to avoid missing high risk patients, and then get as high a PPV as possible to unnecessarily treat as few as possible.

# Fit the model on the training data
final_model <- glm(
  fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  data = training,
  family = "binomial"
)

# Predict probabilities on the validation data
pred_probs <- predict(final_model, newdata = validate, type = "response")

# Get the model summary
# summary(final_model)

# Calculate coefficients and confidence intervals via odds ratios
# Align the confidence intervals with the coefficients
conf_intervals <- confint(final_model)
coefficients <- coef(summary(final_model))

# Find common row names to align them
common_rows <- intersect(rownames(coefficients), rownames(conf_intervals))

# Subset both to only include rows present in both
aligned_coefficients <- coefficients[common_rows, , drop = FALSE]
aligned_conf_intervals <- conf_intervals[common_rows, , drop = FALSE]

# Construct the summary table
summary_table <- cbind(
  OR = exp(aligned_coefficients[, 1]), # Odds Ratios
  `Lower CI` = exp(aligned_conf_intervals[, 1]), # Lower 95% CI
  `Upper CI` = exp(aligned_conf_intervals[, 2]), # Upper 95% CI
  `p-value` = aligned_coefficients[, 4] # p-values
)

# Convert to data frame for readability
summary_df <- as.data.frame(summary_table)
summary_df <- cbind(Variable = rownames(summary_table), summary_df)
rownames(summary_df) <- NULL

# Print the summary table
Obj2Model_table = gt(summary_df)
Obj2Model_table

Variable	OR	Lower CI	Upper CI	p-value
(Intercept)	1.429096e+02	8.796145e-02	2.786799e+05	0.19211779
priorfracYes	8.735511e-01	5.094300e-02	1.477655e+01	0.92520717
weight	1.003792e+00	9.786629e-01	1.028368e+00	0.76319893
height	9.541155e-01	9.097152e-01	9.990308e-01	0.04843128
premenoYes	1.492206e+00	7.157343e-01	3.050440e+00	0.27664087
momfracYes	2.327037e+00	3.043107e-01	1.248026e+01	0.34962474
smokeYes	2.989384e-01	4.276989e-02	1.227005e+00	0.14115928
rateriskSame	6.613486e-01	2.779474e-01	1.526743e+00	0.33810101
rateriskGreater	8.848926e-01	3.243645e-01	2.298111e+00	0.80527645
fracscore	1.226205e+00	1.081536e+00	1.395238e+00	0.00163925
bonemedYes	9.499833e-01	2.583419e-01	3.410782e+00	0.93747251
bonemed_fuYes	6.828775e-01	8.545905e-02	3.648911e+00	0.67934261
priorfracYes:bonemedYes	4.832539e-01	1.307311e-01	1.780721e+00	0.27339477
priorfracYes:weight	1.013018e+00	9.767862e-01	1.051070e+00	0.48671485
momfracYes:rateriskSame	2.337376e+00	2.909083e-01	2.415223e+01	0.43910281
momfracYes:rateriskGreater	2.331703e+00	2.746159e-01	2.543676e+01	0.45259513
momfracYes:bonemedYes	1.005446e-01	1.488388e-02	5.672243e-01	0.01201033
momfracYes:smokeYes	7.081387e-07	NA	9.234488e+121	0.99223692
smokeYes:bonemedYes	4.694597e+07	1.511664e-37	NA	0.98225338
rateriskSame:bonemed_fuYes	6.155011e+00	1.021168e+00	5.348160e+01	0.06332783
rateriskGreater:bonemed_fuYes	2.612131e+00	4.356860e-01	2.262182e+01	0.32547658

# Save the table as a png for ppt
# gtsave(Obj2Model_table, "figures/Obj2Model_tableV2.png")

# Change the threshold for at least 0.8 sensitivity and highest PPV and create the confusion matrix
threshold <- 0.20 # overtreat 40, undertreat 7 -> incorrect 47, correct 102

# Ensure consistent factor levels
obj2.preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))

# Confusion matrix with explicit positive class
conf_matrix <- confusionMatrix(data = obj2.preds, reference = validate$fracture, positive = "Yes")
print(conf_matrix)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  72   7
       Yes 40  30
                                          
               Accuracy : 0.6846          
                 95% CI : (0.6035, 0.7582)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.9745          
                                          
                  Kappa : 0.3493          
                                          
 Mcnemar's Test P-Value : 3.046e-06       
                                          
            Sensitivity : 0.8108          
            Specificity : 0.6429          
         Pos Pred Value : 0.4286          
         Neg Pred Value : 0.9114          
             Prevalence : 0.2483          
         Detection Rate : 0.2013          
   Detection Prevalence : 0.4698          
      Balanced Accuracy : 0.7268          
                                          
       'Positive' Class : Yes

The formula of the final Objective 2 logistic regression model is:
fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu.

With a threshold of 0.20 to prioritize treating high risk individuals while maintain reasonable avoidance of unnecessary intervention for low risk individuals, the model achieves the following metrics: - AUROC: 0.76 - Sensitivity: 0.81
- Specificity: 0.64 - PPV: 0.43
- NPV: 0.91
- Prevalence: 0.25
- Logloss Validation: 0.50 - Logloss CV: 0.62
- Hoslem p-value: 0.36

Discriminant Analysis Model

Here, I try different predictors in both LDA and QDA. The caret package essentially dummetizes the categorical predictors, which may weaken the assumption of MVN.

set.seed(1234)

# Define trainControl
fitControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 1, 
                           classProbs = TRUE, 
                           summaryFunction = mnLogLoss)

# Ensure levels of 'fracture' are consistent in both datasets
training$fracture <- factor(training$fracture, levels = c("No", "Yes"))
validate$fracture <- factor(validate$fracture, levels = c("No", "Yes"))

# Define predictor combinations
model_formulas <- list(
  obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  fit_num1 = fracture ~ weight + height + fracscore,
  # fit_numInt1 = fracture ~ weight + height + weight:height + fracscore,
  # fit_numPoly1 = fracture ~ weight + poly(weight, 2) + height + fracscore,
  # fit_num2 = fracture ~ weight + height + fracscore + age,
  fit_num3 = fracture ~ bmi + fracscore + age + weight + height,
  # fit_numPoly3 = fracture ~ bmi + fracscore + age + weight + poly(weight, 2) + height,
  # fit3 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + raterisk + fracscore + bonemed_fu,
  # fit5 = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed_fu,
  fit2b = fracture ~ priorfrac + weight + height + momfrac + smoke + raterisk + fracscore + bonemed_fu,
  fit4b = fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu
  # Multiplicative = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu
  # Quadratic = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed
)

# Define thresholds to test
thresholds <- seq(0.1, 0.9, by = 0.01)

# Function to calculate log loss manually
calculate_logloss <- function(obs, pred_probs) {
  epsilon <- 1e-15 # Avoid log(0)
  pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
  -mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
}

# Function to evaluate a single model/formula
evaluate_model <- function(model, formula_name, formula, training, validate, thresholds) {
  # Try-catch to handle model fitting errors
  fit <- tryCatch({
    train(
      formula,
      data = training,
      method = model,
      trControl = fitControl,
      metric = "logLoss"
    )
  }, error = function(e) return(NULL))
  
  # Skip evaluation if model training failed
  if (is.null(fit)) return(NULL)
  
  # Extract CV log loss
  cv_logloss <- min(fit$results$logLoss, na.rm = TRUE)
  
  # Compute predicted probabilities on validation set
  predictions <- predict(fit, validate, type = "prob")[, "Yes"]
  
  # Compute validation log loss
  validation_logloss <- calculate_logloss(validate$fracture, predictions)
  
  # Initialize result storage
  best_result <- NULL
  
  # Loop through thresholds
  for (threshold in thresholds) {
    preds <- factor(ifelse(predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))
    cm <- confusionMatrix(data = preds, reference = validate$fracture, positive = "Yes")
    
    # Calculate additional metrics
    auroc <- auc(validate$fracture == "Yes", predictions)
    ppv <- cm$byClass["Pos Pred Value"]
    sensitivity <- cm$byClass["Sensitivity"]
    specificity <- cm$byClass["Specificity"]
    
    # Filter for models meeting Sensitivity >= 0.8
    if (sensitivity >= 0.8) {
      result <- data.frame(
        Model = model,
        FormulaName = formula_name,  # Use formula name instead of the formula itself
        Threshold = threshold,
        AUROC = auroc,
        CV_LogLoss = cv_logloss,
        Validation_LogLoss = validation_logloss,
        Sensitivity = sensitivity,
        Specificity = specificity,
        PPV = ppv,
        NPV = cm$byClass["Neg Pred Value"],
        Prevalence = cm$byClass["Prevalence"]
      )
      
      # Keep only the best (highest PPV) model for this formula
      if (is.null(best_result) || result$PPV > best_result$PPV) {
        best_result <- result
      }
    }
  }
  
  return(best_result)
}

# Initialize results storage
lda_results <- data.frame()

# Test models with LDA
for (formula_name in names(model_formulas)) {
  formula <- model_formulas[[formula_name]]
  result <- evaluate_model("lda", formula_name, formula, training, validate, thresholds)
  if (!is.null(result)) {
    lda_results <- rbind(lda_results, result)
  }
}

# Ensure AUROC is numeric before sorting
lda_results$AUROC <- as.numeric(lda_results$AUROC)

# Organize results by AUROC (desc), then Sensitivity (desc), then PPV (desc)
lda_results <- lda_results[order(-lda_results$AUROC, -lda_results$Sensitivity, -lda_results$PPV), ]
top_models <- head(lda_results, n = 3)  # Select top 3 models
print(lda_results)

             Model FormulaName Threshold     AUROC CV_LogLoss
Sensitivity3   lda       fit2b      0.20 0.7471042  0.5389635
Sensitivity4   lda       fit4b      0.20 0.7451737  0.5366474
Sensitivity    lda     obj1fit      0.17 0.7415541  0.5515056
Sensitivity2   lda    fit_num3      0.21 0.7094595  0.5407943
Sensitivity1   lda    fit_num1      0.18 0.6982384  0.5329027
             Validation_LogLoss Sensitivity Specificity       PPV       NPV
Sensitivity3          0.5144550   0.8108108   0.6517857 0.4347826 0.9125000
Sensitivity4          0.5151788   0.8378378   0.6428571 0.4366197 0.9230769
Sensitivity           0.5157334   0.8108108   0.5535714 0.3750000 0.8985507
Sensitivity2          0.5123398   0.8108108   0.5982143 0.4000000 0.9054054
Sensitivity1          0.5170010   0.8108108   0.4642857 0.3333333 0.8813559
             Prevalence
Sensitivity3  0.2483221
Sensitivity4  0.2483221
Sensitivity   0.2483221
Sensitivity2  0.2483221
Sensitivity1  0.2483221

# # Test top models with QDA
# qda_results <- data.frame()
# for (i in 1:nrow(top_models)) {
#   # Retrieve the formula using the formula name
#   formula_name <- top_models$FormulaName[i]
#   formula <- model_formulas[[formula_name]]
#   
#   # Evaluate the model with QDA
#   result <- evaluate_model("qda", formula_name, formula, training, validate, thresholds)
#   if (!is.null(result)) {
#     qda_results <- rbind(qda_results, result)
#   }
# }
# 
# # Organize QDA results by AUROC (desc), Sensitivity (desc), and PPV (desc)
# qda_results$AUROC <- as.numeric(qda_results$AUROC)
# qda_results <- qda_results[order(-qda_results$AUROC, -qda_results$Sensitivity, -qda_results$PPV), ]
# print(qda_results)

# Additional considerations
# Interaction terms: Consider them if domain knowledge suggests potential interactions.
# Composite variables: Create if logical combinations or transformations improve interpretability or predictive power.

The LDA models performed better than the QDA models.

Here, I explored how LDA would perform with the PCs rather than original numerical predictors.

# Explore LDA with the PCs from PCA
set.seed(1234)

# Split pc.scores into training and validation sets
trainIndex <- createDataPartition(pc.scores$fracture, p = 0.7, list = FALSE) # 70% training, 30% validation
pca.training <- pc.scores[trainIndex, ]
pca.validation <- pc.scores[-trainIndex, ]

# Check structure of training and validation datasets
str(pca.training)

'data.frame':   351 obs. of  6 variables:
 $ fracture: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ PC1     : num  -1.452 1.595 0.209 2.748 -0.786 ...
 $ PC2     : num  0.457 0.438 -0.234 0.422 -2.034 ...
 $ PC3     : num  0.5601 -0.1976 0.0262 1.1689 -0.3956 ...
 $ PC4     : num  0.27 0.657 -0.229 0.472 0.146 ...
 $ PC5     : num  0.01149 -0.0098 0.00583 -0.08762 0.01833 ...

str(pca.validation)

'data.frame':   149 obs. of  6 variables:
 $ fracture: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ PC1     : num  -0.717 3.756 -0.577 3.435 2.187 ...
 $ PC2     : num  -0.872 1.451 -0.802 -0.233 1.271 ...
 $ PC3     : num  0.823 -0.351 1.791 0.535 0.247 ...
 $ PC4     : num  0.2202 -0.5973 0.1291 -0.0211 0.421 ...
 $ PC5     : num  -0.00188 -0.02011 0.01323 -0.11383 -0.01166 ...

# Define trainControl for LDA
fitControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 1, 
                           classProbs = TRUE, 
                           summaryFunction = mnLogLoss)

# Fit the LDA model using PC1 and PC2 (adjust as needed)
fit.pca <- train(
  fracture ~ PC1 + PC2, # Use relevant PCs
  data = pca.training,
  method = "lda",
  trControl = fitControl,
  metric = "logLoss"
)

# Predict probabilities on validation set
pca.predictions <- predict(fit.pca, pca.validation, type = "prob")[, "Yes"]

# Apply a threshold to classify
threshold <- 0.5 # Adjust threshold as needed
pca.preds <- factor(ifelse(pca.predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))

# Compute confusion matrix
pca.cm <- confusionMatrix(data = pca.preds, reference = pca.validation$fracture, positive = "Yes")
print(pca.cm)

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  109  35
       Yes   3   2
                                          
               Accuracy : 0.745           
                 95% CI : (0.6672, 0.8128)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.6175          
                                          
                  Kappa : 0.0384          
                                          
 Mcnemar's Test P-Value : 4.934e-07       
                                          
            Sensitivity : 0.05405         
            Specificity : 0.97321         
         Pos Pred Value : 0.40000         
         Neg Pred Value : 0.75694         
             Prevalence : 0.24832         
         Detection Rate : 0.01342         
   Detection Prevalence : 0.03356         
      Balanced Accuracy : 0.51363         
                                          
       'Positive' Class : Yes

# Print the model and important metrics
print(fit.pca)

Linear Discriminant Analysis 

351 samples
  2 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 315, 316, 316, 316, 316, 316, ... 
Resampling results:

  logLoss  
  0.5371827

# Predict probabilities on validation set
pca.predictions <- predict(fit.pca, pca.validation, type = "prob")[, "Yes"]

# Compute AUROC
pca.roc <- roc(pca.validation$fracture, pca.predictions, levels = c("No", "Yes"))
auroc <- auc(pca.roc)
print(paste("AUROC:", round(auroc, 3)))

[1] "AUROC: 0.677"

# Initialize a data frame to store metrics at different thresholds
thresholds <- seq(0.01, 0.99, by = 0.01)
results <- data.frame(
  Threshold = thresholds,
  Sensitivity = NA,
  Specificity = NA,
  PPV = NA,
  NPV = NA
)

# Loop through thresholds to calculate metrics
for (i in seq_along(thresholds)) {
  threshold <- thresholds[i]
  pca.preds <- factor(ifelse(pca.predictions > threshold, "Yes", "No"), levels = c("No", "Yes"))
  cm <- confusionMatrix(data = pca.preds, reference = pca.validation$fracture, positive = "Yes")
  
  # Store metrics
  results$Sensitivity[i] <- cm$byClass["Sensitivity"]
  results$Specificity[i] <- cm$byClass["Specificity"]
  results$PPV[i] <- cm$byClass["Pos Pred Value"]
  results$NPV[i] <- cm$byClass["Neg Pred Value"]
}

# Find threshold with at least 0.8 sensitivity and maximum PPV
optimal_threshold <- results %>%
  filter(Sensitivity >= 0.8) %>%
  arrange(desc(PPV)) %>%
  slice(1)

print(optimal_threshold)

  Threshold Sensitivity Specificity       PPV      NPV
1       0.2   0.8108108   0.5089286 0.3529412 0.890625

# Apply the optimal threshold to classify
final_threshold <- optimal_threshold$Threshold
pca.final.preds <- factor(ifelse(pca.predictions > final_threshold, "Yes", "No"), levels = c("No", "Yes"))
final_cm <- confusionMatrix(data = pca.final.preds, reference = pca.validation$fracture, positive = "Yes")

# Print final confusion matrix and metrics
print(final_cm)

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  57   7
       Yes 55  30
                                         
               Accuracy : 0.5839         
                 95% CI : (0.5004, 0.664)
    No Information Rate : 0.7517         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.2229         
                                         
 Mcnemar's Test P-Value : 2.387e-09      
                                         
            Sensitivity : 0.8108         
            Specificity : 0.5089         
         Pos Pred Value : 0.3529         
         Neg Pred Value : 0.8906         
             Prevalence : 0.2483         
         Detection Rate : 0.2013         
   Detection Prevalence : 0.5705         
      Balanced Accuracy : 0.6599         
                                         
       'Positive' Class : Yes

print(paste("Final Threshold:", final_threshold))

[1] "Final Threshold: 0.2"

There didn’t seem to be an improvement.

Here, I will fit the final LDA model, adjust the threshold, and get the confusion matrix and metrics.

fitControl<-trainControl(method="repeatedcv",number=10,repeats=1,classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(1234)

lda.fit<-train(fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
               data=training,
               method="lda",
               trControl=fitControl,
               metric="logLoss")

#Computing predicted probabilities on the validation set
predictions <- predict(lda.fit, validate, type = "prob")[,"Yes"]

# Change the threshold to achieve high PPV with sensitivity over 0.8
threshold=0.20
lda.preds<-factor(ifelse(predictions>threshold,"Yes","No"),levels=c("Yes","No"))
confusionMatrix(data = lda.preds, reference = validate$fracture, positive="Yes")

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  72   6
       Yes 40  31
                                          
               Accuracy : 0.6913          
                 95% CI : (0.6105, 0.7643)
    No Information Rate : 0.7517          
    P-Value [Acc > NIR] : 0.9617          
                                          
                  Kappa : 0.3676          
                                          
 Mcnemar's Test P-Value : 1.141e-06       
                                          
            Sensitivity : 0.8378          
            Specificity : 0.6429          
         Pos Pred Value : 0.4366          
         Neg Pred Value : 0.9231          
             Prevalence : 0.2483          
         Detection Rate : 0.2081          
   Detection Prevalence : 0.4765          
      Balanced Accuracy : 0.7403          
                                          
       'Positive' Class : Yes

LDA is likely not the best tool for this dataset with only 5 numerical predictors. Dummetizing the categorical predictors improves the performance. However, it weakens the assumption of multivariate normality.

The formula of the final Objective 2 LDA model is:
fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu (fit4b).

With a threshold of 0.20 to prioritize treating high risk individuals while maintain reasonable avoidance of unnecessary intervention for low risk individuals, the model achieves the following metrics:
- AUROC: 0.75 - Sensitivity: 0.84
- Specificity: 0.64 - PPV: 0.44
- NPV: 0.92
- Prevalence: 0.25
- Logloss Validation: 0.52 - Logloss CV: 0.54

Sticking only with numerical predictors, using all numerical predictors from the full dataset results in a model that performs nearly as well as dummetizing variables. That model is: fracture ~ bmi + fracscore + age + weight + height.

With a threshold of 0.21, the model achieves the following metrics:
- AUROC: 0.71 - Sensitivity: 0.81
- Specificity: 0.60 - PPV: 0.40
- NPV: 0.91
- Prevalence: 0.25
- Logloss Validation: 0.51 - Logloss CV: 0.54

Here I experimented with plotting the decision boundary for two of the more important predictors just out of curiousity.

# Set up control and validation grid
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss)
set.seed(1234)

# Model with dummetized variables
threshold = 0.2
lda.fit<-train(fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
               data=training,
               method="lda",
               trControl=fitControl,
               metric="logLoss")

# Model with only numerical variables
# threshold = 0.21
# lda.fit<-train(fracture ~ bmi + fracscore + age + weight + height,
#                data=training,
#                method="lda",
#                trControl=fitControl,
#                metric="logLoss")

#Computing predicted probabilities on the validation set
predictions <- predict(lda.fit, validate, type = "prob")[,"Yes"]
class_pred <- as.factor(ifelse(predictions > threshold, "1", "0"))
color_array <- c("red3", "blue3")[as.numeric(class_pred)]

# Plot data and decision boundary
plot(training$height, training$fracscore, col = training$fracture, asp = 1, pch = 16, cex = 0.7, xlab = "height", ylab = "fracscore")

Non-Parametric Model

Here, I fit a random forest model with all of my original variables to get the variable importance scores. I will use them with previous model information to test for the best performance.

# Build a random forest model to look at variable importance

set.seed(1234)

train_control = trainControl(method = "repeatedcv", number = 10, repeats = 1)

# Define the grid for tuning 'mtry'
tune_grid = expand.grid(
  .mtry = c(2:6)  # Number of variables randomly sampled as candidates at each split (good start p/3)
)

# str(bone_subset %>% dplyr::select(-site_id, -phy_id, -bmi, -bonetreat))
df = training %>% dplyr::select(-site_id, -phy_id, -bmi, -bonetreat)

# Fit a random forest model (bagging is inherent in random forest)
model_rf = train(
  fracture ~ .,
  data = df,
  method = "rf",                    # Use 'randomForest' method
  trControl = train_control,
  tuneGrid = tune_grid,
  # tunelength = 3,
  ntree = 100,                      # Specify the number of trees here
  importance = TRUE                # Compute variable importance
)

# Print the best model parameters
print(model_rf)

Random Forest 

351 samples
 12 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 316, 317, 315, 315, 316, 316, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa     
  2     0.7266293  0.01657723
  3     0.7238515  0.07137503
  4     0.7181373  0.06325066
  5     0.7211578  0.07931540
  6     0.7211531  0.07738378

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

# Plot the performance across different parameter combinations
plot(model_rf)

# Get the importance of variables
var_importance = varImp(model_rf)

# Print the importance scores
print(var_importance)

rf variable importance

                Importance
fracscore           100.00
priorfracYes         92.72
momfracYes           88.13
bonemed_fuYes        87.63
age                  83.75
height               77.43
bonemedYes           72.63
rateriskSame         59.29
smokeYes             52.33
weight               44.35
rateriskGreater      30.62
premenoYes           11.88
armassistYes          0.00

# Plot the variable importance
plot(var_importance)

Here, I try different predictors in a random forest model to test for the best performance.

set.seed(1234)

# Define training control for cross-validation
train_control <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = TRUE,
  summaryFunction = mnLogLoss
)

# Define the grid for tuning 'mtry'
tune_grid <- expand.grid(
  .mtry = 2:6  # Number of variables randomly sampled as candidates at each split
)

# Define predictor combinations
predictor_sets <- list(
  obj1fit = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  fit2b = fracture ~ priorfrac + weight + height + momfrac + smoke + raterisk + fracscore + bonemed_fu,
  fit4b = fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu, #ldafit
  Multiplicative = fracture ~ height + premeno + momfrac + smoke + fracscore + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  obj2fit = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  Quadratic = fracture ~ fracscore + weight + poly(weight, 2) + height + smoke + momfrac + premeno + bonemed + priorfrac + priorfrac:premeno + priorfrac:bonemed + priorfrac:weight + priorfrac:poly(weight, 2) + smoke:bonemed + momfrac:bonemed + weight:bonemed + poly(weight, 2):bonemed,
  fit_imp1 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight,
  fit_imp2 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + age + weight,
  fit_imp3 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk, # chosen
  fit_imp3b = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk + priorfrac:bonemed + priorfrac:weight + smoke:bonemed + raterisk:bonemed_fu
  # fit_imp4 = fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk + momfrac
)

# Initialize results storage
results <- data.frame(
  Predictor_Set = character(),
  Threshold = numeric(),
  Sensitivity = numeric(),
  Specificity = numeric(),
  PPV = numeric(),
  NPV = numeric(),
  AUROC = numeric(),
  LogLoss_Validation = numeric(),
  LogLoss_CV = numeric(),
  stringsAsFactors = FALSE
)

# Loop through predictor combinations
for (predictor_name in names(predictor_sets)) {
  formula <- predictor_sets[[predictor_name]]
  
  # Train the Random Forest model
  model_rf <- train(
    formula,
    data = training,
    method = "rf",
    trControl = train_control,
    tuneGrid = tune_grid,
    metric = "logLoss",
    ntree = 100,
    importance = TRUE
  )
  
  # Compute cross-validation log loss
  cv_logloss <- min(model_rf$results$logLoss)
  
  # Predict probabilities on validation set
  pred_probs <- predict(model_rf, validate, type = "prob")[, "Yes"]
  
  # Compute validation log loss
  calculate_logloss <- function(obs, pred_probs) {
    epsilon <- 1e-15
    pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
    -mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
  }
  validation_logloss <- calculate_logloss(validate$fracture, pred_probs)
  
  # Compute AUROC
  roc_curve <- roc(validate$fracture, pred_probs, levels = c("No", "Yes"))
  auroc <- auc(roc_curve)
  
  # Optimize threshold for Sensitivity ≥ 0.8 and highest PPV
  thresholds <- seq(0, 1, by = 0.01)
  stats <- data.frame(
    Threshold = numeric(),
    Sensitivity = numeric(),
    Specificity = numeric(),
    PPV = numeric(),
    NPV = numeric()
  )
  
  for (threshold in thresholds) {
    class_pred <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
    cm <- confusionMatrix(class_pred, validate$fracture, positive = "Yes")
    
    stats <- rbind(stats, data.frame(
      Threshold = threshold,
      Sensitivity = cm$byClass["Sensitivity"],
      Specificity = cm$byClass["Specificity"],
      PPV = cm$byClass["Pos Pred Value"],
      NPV = cm$byClass["Neg Pred Value"]
    ))
  }
  
  # Filter for Sensitivity ≥ 0.8 and find the best threshold
  best_stats <- stats %>% filter(Sensitivity >= 0.8) %>% arrange(desc(PPV))
  
  if (nrow(best_stats) > 0) {
    best_threshold <- best_stats$Threshold[1]
    best_row <- best_stats[1, ]
    
    # Store the results for this predictor set
    results <- rbind(results, data.frame(
      Predictor_Set = predictor_name,
      Threshold = best_threshold,
      Sensitivity = best_row$Sensitivity,
      Specificity = best_row$Specificity,
      PPV = best_row$PPV,
      NPV = best_row$NPV,
      AUROC = auroc,
      LogLoss_Validation = validation_logloss,
      LogLoss_CV = cv_logloss
    ))
  } else {
    # If no threshold meets the sensitivity requirement, store NA for this predictor set
    results <- rbind(results, data.frame(
      Predictor_Set = predictor_name,
      Threshold = NA,
      Sensitivity = NA,
      Specificity = NA,
      PPV = NA,
      NPV = NA,
      AUROC = auroc,
      LogLoss_Validation = validation_logloss,
      LogLoss_CV = cv_logloss
    ))
  }
}

# Print the results
results <- results[order(-as.numeric(results$AUROC), -results$Sensitivity, -results$PPV), ]
print(results)

    Predictor_Set Threshold Sensitivity Specificity       PPV       NPV
5         obj2fit      0.13   0.8648649   0.4642857 0.3478261 0.9122807
4  Multiplicative      0.11   0.8108108   0.4196429 0.3157895 0.8703704
10      fit_imp3b      0.12   0.8378378   0.4196429 0.3229167 0.8867925
3           fit4b      0.11   0.9189189   0.3303571 0.3119266 0.9250000
2           fit2b      0.12   0.8648649   0.4017857 0.3232323 0.9000000
9        fit_imp3      0.16   0.8108108   0.4285714 0.3191489 0.8727273
1         obj1fit      0.18   0.8648649   0.4285714 0.3333333 0.9056604
6       Quadratic      0.10   0.8378378   0.3839286 0.3100000 0.8775510
8        fit_imp2      0.16   0.8108108   0.4553571 0.3296703 0.8793103
7        fit_imp1      0.15   0.8108108   0.4107143 0.3125000 0.8679245
       AUROC LogLoss_Validation LogLoss_CV
5  0.7098214          0.7234368  0.6670635
4  0.6951014          0.5368528  0.6637025
10 0.6794160          0.5489315  0.5581826
3  0.6755550          0.5437283  0.5860024
2  0.6651786          0.5517397  0.6682615
9  0.6637307          0.5445528  0.5513502
1  0.6619208          0.7488199  0.5657350
6  0.6403234          0.7856626  0.7681333
8  0.6323600          0.5712087  0.6611917
7  0.6164334          0.5766410  0.6884587

finalAUROC = results$AUROC

I chose the predictors for my random forest model using the predictors with an importance over 10% (with the exception of age which was highly correlated with fracscore). The chosen model results in a comparatively low validation logloss that is also similar to its cv logloss. The AUROC and PPV are also comparatively high with a sensitivity of over .80 with a threshold of 0.16.

The formula of the final Objective 2 non-parametric, Random Forest model is:
fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk (fit_imp3).

With a threshold of 0.16 to prioritize treating high risk individuals while maintain reasonable avoidance of unnecessary intervention for low risk individuals, the model achieves the following metrics:
- AUROC: 0.68 (This value might change even with a random seed, because RF makes different decisions trees each time.) - Sensitivity: 0.81
- Specificity: 0.50 - PPV: 0.35
- NPV: 0.89
- Prevalence: 0.25
- Logloss Validation: 0.54 - Logloss CV: 0.58

This plots an ROC curve for all the models.

# Fit the models first

set.seed(1234)

logistic_obj1 <- glm(
  formula = fracture ~ priorfrac + weight + height + premeno + momfrac + armassist + smoke + raterisk + fracscore + bonemed + bonemed_fu,
  data = training,
  family = "binomial"
)

logistic_obj2 <- glm(
  formula = fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  data = training,
  family = "binomial"
)

lda_model <- train(
  fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
  data = training,
  method = "lda",
  trControl = trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = mnLogLoss),
  metric = "logLoss"
)
rf_model <- train(
  fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk,
  data = training,
  method = "rf",
  trControl = trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = mnLogLoss),
  tuneGrid = expand.grid(.mtry = 2:6),
  ntree = 100,
  metric = "logLoss"
)

# Predicting probabilities on the validation data
obj1.predprobs <- predict(logistic_obj1, validate, type = "response")
obj2_log.predprobs <- predict(logistic_obj2, validate, type = "response")
lda.predprobs <- predict(lda_model, validate, type = "prob")[, "Yes"]
rf.predprobs <- predict(rf_model, validate, type = "prob")[, "Yes"]

# Calculate ROC and AUROC
obj1.roc <- roc(response = validate$fracture, predictor = obj1.predprobs, levels = c("No", "Yes"))
obj2_log.roc <- roc(response = validate$fracture, predictor = obj2_log.predprobs, levels = c("No", "Yes"))
lda.roc <- roc(response = validate$fracture, predictor = lda.predprobs, levels = c("No", "Yes"))
rf.roc <- roc(response = validate$fracture, predictor = rf.predprobs, levels = c("No", "Yes"))

# Extract AUROC values
obj1.auc <- round(auc(obj1fit.roc), 3)
obj2_log.auc <- round(auc(obj2_log.roc), 3)
lda.auc <- round(auc(lda.roc), 3)
rf.auc <- round(auc(rf.roc), 3)

finalAUROC = c(obj1.auc, obj2_log.auc, lda.auc, rf.auc)

# Save plot to a PNG file
# png("figuresKH/ROC_Curves_with_AUROC.png", width = 800, height = 600)

# Plot ROC curves
plot(obj1.roc, col = "purple3", lwd = 2, lty = 1, main = "ROC Curves for Candidate Models") #print.thres="best"
plot(obj2_log.roc, col = "green4", lwd = 2, lty = 1, add = TRUE)
plot(lda.roc, col = "red3", lwd = 2, lty = 1, add = TRUE)
plot(rf.roc, col = "blue2", lwd = 2, lty = 1, add = TRUE) 

# Add legend with AUROC values
legend("bottomright",
       legend = c(
         paste("Log Obj1 (AUROC =", finalAUROC[1], ")"),
         paste("Log Obj2 (AUROC =", finalAUROC[2], ")"),
         paste("LDA Obj2 (AUROC =", finalAUROC[3], ")"),
         paste("RF Obj2 (AUROC =", finalAUROC[4], ")")
       ),
       col = c("purple3","green4", "red3", "blue2"),
       lwd = 2, lty = c(1, 1, 1, 1), cex = 1, xpd = TRUE, horiz = FALSE)

# Close the graphics device
# dev.off()

Here, I use the above fitted models to plot a bias-variance curve.

# Bias-Variance Tradeoff

# Define pre-fitted models
models <- list(
  "Logistic (additive)" = logistic_obj1,
  "Logistic (with interactions)" = logistic_obj2,
  "LDA" = lda_model,
  "Random Forest" = rf_model
  )

# Initialize an empty data frame to store errors and metrics
errors <- data.frame()

# Compute errors and metrics
errors <- data.frame()

# Iterate over model indices
for (i in seq_along(models)) {
  model_name <- names(models)[i]
  model <- models[[model_name]]
  
  # Predict on training and validation sets
  if (inherits(model, "train")) {
    train_preds <- predict(model, training, type = "prob")[, "Yes"]
    val_preds <- predict(model, validate, type = "prob")[, "Yes"]
  } else {
    train_preds <- predict(model, training, type = "response")
    val_preds <- predict(model, validate, type = "response")
  }
  
  # Compute metrics
  train_error <- logLoss(actual = ifelse(training$fracture == "Yes", 1, 0), predicted = train_preds)
  val_error <- logLoss(actual = ifelse(validate$fracture == "Yes", 1, 0), predicted = val_preds)
  roc_obj <- roc(response = validate$fracture, predictor = val_preds, levels = c("No", "Yes"))
  roc_auc <- as.numeric(auc(roc_obj))
  
  # Store results
  errors <- rbind(errors, data.frame(
    Model = model_name,
    TrainingError = train_error,
    ValidationError = val_error,
    ROC = finalAUROC[i] # Use the corresponding value from finalAUROC
  ))
}

# Set factor levels for the models to match the order in the list
errors$Model <- factor(errors$Model, levels = names(models))

# Plot Bias-Variance Tradeoff
bv_plot <- ggplot(errors, aes(x = Model)) +
  geom_line(aes(y = TrainingError, color = "Training Log Loss"), group = 1, linewidth = 1) +
  geom_line(aes(y = ValidationError, color = "Validation Log Loss"), group = 1, linewidth = 1) +
  geom_point(aes(y = TrainingError, color = "Training Log Loss"), size = 2) +
  geom_point(aes(y = ValidationError, color = "Validation Log Loss"), size = 2) +
  geom_line(aes(y = ROC, color = "AUROC"), group = 1, linewidth = 1, linetype = "dotted") +
  geom_point(aes(y = ROC, color = "AUROC"), size = 2) +
  labs(
    title = "Bias-Variance Tradeoff Across Models",
    x = "Model",
    y = "Metric",
    color = "Metric Type"
  ) +
  scale_x_discrete(labels = function(x) gsub(" \\(", "\n(", x)) +
  scale_color_manual(values = c("gray30", "turquoise4", "darkorange3")) +
  theme_bw()

print(bv_plot)

ggsave(plot = bv_plot + theme(text = element_text(size = 14)),
         filename = "~/Desktop/bv_v2_3by2.png",
         width = 9, height = 6, dpi = 200)

This produces a table with the metrics for each model for comparison.

# Function to calculate metrics
calculate_metrics <- function(model_name, model_fit, data_train, data_validate, threshold) {
  # Predict probabilities on the validation data
  pred_probs <- predict(model_fit, newdata = data_validate, type = "prob")[, "Yes"]
  
  # Generate class predictions based on the threshold
  class_preds <- factor(ifelse(pred_probs > threshold, "Yes", "No"), levels = c("No", "Yes"))
  
  # Compute confusion matrix
  cm <- confusionMatrix(data = class_preds, reference = data_validate$fracture, positive = "Yes")
  
  # Calculate AUROC
  roc_curve <- roc(data_validate$fracture, pred_probs, levels = c("No", "Yes"))
  auroc <- auc(roc_curve)
  
  # Compute validation log loss
  calculate_logloss <- function(obs, pred_probs) {
    epsilon <- 1e-15
    pred_probs <- pmax(pmin(pred_probs, 1 - epsilon), epsilon)
    -mean(ifelse(obs == "Yes", log(pred_probs), log(1 - pred_probs)))
  }
  val_logloss <- calculate_logloss(data_validate$fracture, pred_probs)
  
  # Compute CV log loss if available
  cv_logloss <- if (!is.null(model_fit$resample$logLoss)) {
    mean(model_fit$resample$logLoss, na.rm = TRUE)
  } else {
    NA
  }
  
  # Compute additional metrics
  accuracy <- cm$overall["Accuracy"]
  sensitivity <- cm$byClass["Sensitivity"]
  specificity <- cm$byClass["Specificity"]
  ppv <- cm$byClass["Pos Pred Value"]
  npv <- cm$byClass["Neg Pred Value"]
  f1 <- ifelse(sensitivity + ppv > 0, 2 * (sensitivity * ppv) / (sensitivity + ppv), NA)
  prevalence <- cm$byClass["Prevalence"]
  
  # Return metrics as a data frame row
  data.frame(
    Model = model_name,
    AUROC = round(auroc, 3),
    Sensitivity = round(sensitivity, 3),
    Specificity = round(specificity, 3),
    PPV = round(ppv, 3),
    NPV = round(npv, 3),
    Accuracy = round(accuracy, 3),
    F1_Score = round(f1, 3),
    Threshold = threshold,
    Prevalence = round(prevalence, 3),
    Validation_LogLoss = round(val_logloss, 3),
    CV_LogLoss = round(cv_logloss, 3)
  )
}

# Define models manually
set.seed(1234)

logistic_obj1 <- train(
  fracture ~ priorfrac + age + weight + height + momfrac + armassist + raterisk + bonemed,
  data = training,
  method = "glm",
  family = "binomial",
  trControl = trainControl(
    method = "cv",
    number = 10,
    classProbs = TRUE,
    summaryFunction = mnLogLoss
  ),
  metric = "logLoss"
)

logistic_obj2 <- train(
  fracture ~ priorfrac + weight + height + premeno + momfrac + smoke + raterisk + fracscore + bonemed + bonemed_fu + 
    priorfrac:bonemed + priorfrac:weight + momfrac:raterisk + momfrac:bonemed + momfrac:smoke + smoke:bonemed + raterisk:bonemed_fu,
  data = training,
  method = "glm",
  family = "binomial",
  trControl = trainControl(
    method = "cv",
    number = 10,
    classProbs = TRUE,
    summaryFunction = mnLogLoss
  ),
  metric = "logLoss"
)

lda_model <- train(
  fracture ~ priorfrac + weight + height + momfrac + raterisk + fracscore + bonemed_fu,
  data = training,
  method = "lda",
  trControl = trainControl(
    method = "cv",
    number = 10,
    classProbs = TRUE,
    summaryFunction = mnLogLoss
  ),
  metric = "logLoss"
)

rf_model <- train(
  fracture ~ priorfrac + fracscore + bonemed + height + bonemed_fu + weight + smoke + raterisk,
  data = training,
  method = "rf",
  trControl = trainControl(
    method = "cv",
    number = 10,
    classProbs = TRUE,
    summaryFunction = mnLogLoss
  ),
  tuneGrid = expand.grid(.mtry = 2:6),
  ntree = 100,
  metric = "logLoss"
)

# Define thresholds for each model
thresholds <- list(
  "Logistic Obj1" = 0.19,
  "Logistic Obj2" = 0.20,
  "LDA" = 0.20,
  "Random Forest" = 0.16
)

# Collect metrics for all models
results <- rbind(
  calculate_metrics("Logistic Obj1", logistic_obj1, training, validate, thresholds[["Logistic Obj1"]]),
  calculate_metrics("Logistic Obj2", logistic_obj2, training, validate, thresholds[["Logistic Obj2"]]),
  calculate_metrics("LDA", lda_model, training, validate, thresholds[["LDA"]]),
  calculate_metrics("Random Forest", rf_model, training, validate, thresholds[["Random Forest"]])
)

results

                     Model AUROC Sensitivity Specificity   PPV   NPV Accuracy
Sensitivity  Logistic Obj1 0.735       0.865       0.580 0.405 0.929    0.651
Sensitivity1 Logistic Obj2 0.763       0.811       0.643 0.429 0.911    0.685
Sensitivity2           LDA 0.745       0.838       0.643 0.437 0.923    0.691
Sensitivity3 Random Forest 0.679       0.784       0.446 0.319 0.862    0.530
             F1_Score Threshold Prevalence Validation_LogLoss CV_LogLoss
Sensitivity     0.552      0.19      0.248              0.517      0.533
Sensitivity1    0.561      0.20      0.248              0.500      0.616
Sensitivity2    0.574      0.20      0.248              0.515      0.540
Sensitivity3    0.453      0.16      0.248              0.549      0.576

# # Format the results into a gt table
# results_table <- results %>%
#   gt() %>%
#   tab_header(
#     title = "Comparison of Model Performance"
#   ) %>%
#   cols_label(
#     Model = "Model Name",
#     AUROC = "AUROC",
#     Sensitivity = "Sensitivity",
#     Specificity = "Specificity",
#     PPV = "PPV",
#     NPV = "NPV",
#     Accuracy = "Accuracy",
#     F1_Score = "F1 Score",
#     Threshold = "Threshold",
#     Prevalence = "Prevalence",
#     Validation_LogLoss = "Validation Log Loss",
#     CV_LogLoss = "CV Log Loss"
#   )
# 
# # Display the table
# print(results_table)

Conclusion:

Summary of the findings from Objective 1
Fracture score and maternal history of fracture are statistically and practically significant predictors of fracture risk. Prior fracture history and height are also practically significant.

Summary of the findings and recommendations in Objective 2
Complex logistic regression is the best model, with the highest AUROC (0.76) and lowest validation logloss (0.50). It the best choice for prediction.
Additive logistic regression offers slightly lower performance but better interpretability. LDA and Random Forest underperform and are not recommended without further tuning.

Additional discussion points:

Scope of inference?
This study includes 500 women aged 55 and older, who were attended by 325 physicians across six study sites. It appears that this dataset is a subset of patients from a larger study, the Global Longitudinal Study of Osteoporosis in Women (GLOW), which included over 16,000 women at sites worldwide. Based on the study design, it is likely an observational study rather than a randomized controlled experiment or a retrospective study with a controlled distribution of response classes. Due to ethical and practical considerations, we cannot randomly assign participants to specific treatments or medical histories. Therefore, we can infer associations between features and the odds of experiencing a fracture, but we cannot establish causation or quantify changes in the probability of fracture.

Additionally, it is unclear whether this sample of women was randomly selected or if participants volunteered. This limitation restricts the generalizability of the findings. The results should be interpreted within the context of the study population, women at the participating study sites within the specified age range. Despite these limitations, the findings offer valuable insights and provide a foundation for further research.

Future Work
Validation: Test models on new, independent data to assess performance and revalidate at future time points.
Model Updates: Investigate potential modifications if model performance declines over time.
Leverage GLOW: Explore findings from the Global Longitudinal Study of Osteoporosis in Women (GLOW) for insights into data and statistical approaches.
Enhance Random Forest: Further optimize tuning parameters to improve performance.

Why one method performs better than the other for objective 2?
LDA is primarily useful for numeric data as it assumes the predictors are multivariate normal. Though categorical variables can be converted to dummy variables, the metrics may not be reliable when the assumptions aren’t met. Though Random Forest and other non-parametric methods may be ultimately useful, I may need to increase both my breadth and depth of knowledge to both choose the best tool and to tune it optimally. Random Forest for this small dataset may not be the best choice also because it may tend to overfit to the limited training data. Logistic regression seems like our best choice.

Why are one or two metrics more important/informative compared to the others?
Prioritizing sensitivity ensures identifying most women at risk, while PPV minimizes unnecessary treatments by ensuring positive classifications are accurate.
AUROC and logloss are valuable for comparing model performance across all thresholds, providing a threshold-independent assessment of model quality.