10  Logistic Regression

10.1 Read in the Data

diab_pop <- readRDS(here::here('DATA/diab_pop.RDS'))

10.1.1 Reminders about the Data

tibble::tribble(
  ~"Variable in Data", ~"Definition", ~"Data Type",
  'seqn', 'Respondent sequence number', 'Identifier',
  'riagendr', 'Gender', 'Categorical',
  'ridageyr', 'Age in years at screening', 'Continuous / Numerical',
  'ridreth1', 'Race/Hispanic origin', 'Categorical',
  'dmdeduc2', 'Education level', 'Adults 20+  - Categorical',
  'dmdmartl', 'Marital status', 'Categorical',
  'indhhin2', 'Annual household income', 'Categorical',
  'bmxbmi', 'Body Mass Index (kg/m**2)', 'Continuous / Numerical',
  'diq010', 'Doctor diagnosed diabetes', 'Categorical',
  'lbxglu', 'Fasting Glucose (mg/dL)', 'Continuous / Numerical'
  ) |>
  knitr::kable()
Variable in Data Definition Data Type
seqn Respondent sequence number Identifier
riagendr Gender Categorical
ridageyr Age in years at screening Continuous / Numerical
ridreth1 Race/Hispanic origin Categorical
dmdeduc2 Education level Adults 20+ - Categorical
dmdmartl Marital status Categorical
indhhin2 Annual household income Categorical
bmxbmi Body Mass Index (kg/m**2) Continuous / Numerical
diq010 Doctor diagnosed diabetes Categorical
lbxglu Fasting Glucose (mg/dL) Continuous / Numerical

10.1.2 Install if not Function

install_if_not <- function( list.of.packages ) {
  new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
  if(length(new.packages)) { install.packages(new.packages) } else { print(paste0("the package '", list.of.packages , "' is already installed")) }
}

10.2 Split Data into Training and Test Sets.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diab_pop.no_na_vals <- diab_pop %>% na.omit()

library('caret')
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

this will ensure our results are the same every run to randomize you may use: set.seed(Sys.time())

set.seed(8675309) 

The createDataPartition function is used to create training and test sets

trainIndex <- createDataPartition(diab_pop.no_na_vals$diq010, 
                                  p = .6, 
                                  list = FALSE, 
                                  times = 1)
diab_pop.no_na_vals.train <- diab_pop.no_na_vals[trainIndex, ]
diab_pop.no_na_vals.test <- diab_pop.no_na_vals[-trainIndex, ]

10.3 Train Logistic Regression Model

logit_model <- glm(diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl + indhhin2 + bmxbmi + lbxglu,
                   family="binomial",
                   data = diab_pop.no_na_vals.train)
                                             
logit_model

Call:  glm(formula = diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + 
    dmdmartl + indhhin2 + bmxbmi + lbxglu, family = "binomial", 
    data = diab_pop.no_na_vals.train)

Coefficients:
                       (Intercept)                      riagendrFemale  
                           9.80789                            -0.11464  
                          ridageyr              ridreth1Other Hispanic  
                          -0.05503                            -0.29723  
        ridreth1Non-Hispanic White          ridreth1Non-Hispanic Black  
                          -0.38270                            -0.92228  
                     ridreth1Other               dmdeduc2Grades 9-11th  
                          -0.71991                             0.44575  
  dmdeduc2High school graduate/GED  dmdeduc2Some college or AA degrees  
                           1.26584                             0.90043  
     dmdeduc2College grad or above                     dmdmartlWidowed  
                           0.94730                             0.47483  
                  dmdmartlDivorced                   dmdmartlSeparated  
                           0.36843                            -0.04411  
             dmdmartlNever married         dmdmartlLiving with partner  
                          -0.03577                             0.17759  
             indhhin2$5,000-$9,999             indhhin2$10,000-$14,999  
                           0.34427                             1.52998  
           indhhin2$15,000-$19,999             indhhin2$20,000-$24,999  
                           1.78220                             1.08653  
           indhhin2$25,000-$34,999             indhhin2$45,000-$54,999  
                           1.04877                             0.75253  
           indhhin2$65,000-$74,999                     indhhin220,000+  
                           0.90369                             1.03068  
         indhhin2less than $20,000             indhhin2$75,000-$99,999  
                           2.72902                             1.41228  
                 indhhin2$100,000+                              bmxbmi  
                           0.77823                            -0.05353  
                            lbxglu  
                          -0.03826  

Degrees of Freedom: 1125 Total (i.e. Null);  1097 Residual
Null Deviance:      952.3 
Residual Deviance: 551.3    AIC: 609.3
summary(logit_model)

Call:
glm(formula = diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + 
    dmdmartl + indhhin2 + bmxbmi + lbxglu, family = "binomial", 
    data = diab_pop.no_na_vals.train)

Coefficients:
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         9.807888   1.143258   8.579  < 2e-16 ***
riagendrFemale                     -0.114638   0.238933  -0.480  0.63138    
ridageyr                           -0.055027   0.009326  -5.900 3.63e-09 ***
ridreth1Other Hispanic             -0.297226   0.428658  -0.693  0.48807    
ridreth1Non-Hispanic White         -0.382705   0.393014  -0.974  0.33017    
ridreth1Non-Hispanic Black         -0.922281   0.418804  -2.202  0.02765 *  
ridreth1Other                      -0.719906   0.487388  -1.477  0.13966    
dmdeduc2Grades 9-11th               0.445749   0.429922   1.037  0.29982    
dmdeduc2High school graduate/GED    1.265841   0.400733   3.159  0.00158 ** 
dmdeduc2Some college or AA degrees  0.900433   0.389046   2.314  0.02064 *  
dmdeduc2College grad or above       0.947302   0.426745   2.220  0.02643 *  
dmdmartlWidowed                     0.474834   0.399959   1.187  0.23515    
dmdmartlDivorced                    0.368433   0.378205   0.974  0.32998    
dmdmartlSeparated                  -0.044107   0.558009  -0.079  0.93700    
dmdmartlNever married              -0.035769   0.437124  -0.082  0.93478    
dmdmartlLiving with partner         0.177591   0.547438   0.324  0.74563    
indhhin2$5,000-$9,999               0.344274   0.627713   0.548  0.58338    
indhhin2$10,000-$14,999             1.529979   0.657454   2.327  0.01996 *  
indhhin2$15,000-$19,999             1.782201   0.672002   2.652  0.00800 ** 
indhhin2$20,000-$24,999             1.086530   0.639608   1.699  0.08937 .  
indhhin2$25,000-$34,999             1.048766   0.590010   1.778  0.07548 .  
indhhin2$45,000-$54,999             0.752527   0.632012   1.191  0.23378    
indhhin2$65,000-$74,999             0.903686   0.656972   1.376  0.16897    
indhhin220,000+                     1.030682   0.902598   1.142  0.25349    
indhhin2less than $20,000           2.729024   1.140925   2.392  0.01676 *  
indhhin2$75,000-$99,999             1.412279   0.657736   2.147  0.03178 *  
indhhin2$100,000+                   0.778230   0.615139   1.265  0.20582    
bmxbmi                             -0.053526   0.017057  -3.138  0.00170 ** 
lbxglu                             -0.038258   0.003688 -10.373  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 952.29  on 1125  degrees of freedom
Residual deviance: 551.29  on 1097  degrees of freedom
AIC: 609.29

Number of Fisher Scoring iterations: 6
str(logit_model,1)
List of 30
 $ coefficients     : Named num [1:29] 9.808 -0.115 -0.055 -0.297 -0.383 ...
  ..- attr(*, "names")= chr [1:29] "(Intercept)" "riagendrFemale" "ridageyr" "ridreth1Other Hispanic" ...
 $ residuals        : Named num [1:1126] 1.01 1.03 1.68 -1.01 1.13 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ fitted.values    : Named num [1:1126] 0.98802 0.96838 0.59693 0.00902 0.88719 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ effects          : Named num [1:1126] -12.493 0.103 6.2 -0.463 2.082 ...
  ..- attr(*, "names")= chr [1:1126] "(Intercept)" "riagendrFemale" "ridageyr" "ridreth1Other Hispanic" ...
 $ R                : num [1:29, 1:29] -8.93 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
 $ rank             : int 29
 $ qr               :List of 5
  ..- attr(*, "class")= chr "qr"
 $ family           :List of 13
  ..- attr(*, "class")= chr "family"
 $ linear.predictors: Named num [1:1126] 4.413 3.422 0.393 -4.699 2.062 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ deviance         : num 551
 $ aic              : num 609
 $ null.deviance    : num 952
 $ iter             : int 6
 $ weights          : Named num [1:1126] 0.01184 0.03062 0.2406 0.00894 0.10009 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ prior.weights    : Named num [1:1126] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ df.residual      : int 1097
 $ df.null          : int 1125
 $ y                : Named num [1:1126] 1 1 1 0 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:1126] "2" "11" "13" "14" ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  1126 obs. of  9 variables:
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl + indhhin2 +      bmxbmi + lbxglu
  .. .. ..- attr(*, "variables")= language list(diq010, riagendr, ridageyr, ridreth1, dmdeduc2, dmdmartl, indhhin2,      bmxbmi, lbxglu)
  .. .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. ..- attr(*, "term.labels")= chr [1:8] "riagendr" "ridageyr" "ridreth1" "dmdeduc2" ...
  .. .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(diq010, riagendr, ridageyr, ridreth1, dmdeduc2, dmdmartl, indhhin2,      bmxbmi, lbxglu)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "factor" "numeric" "factor" ...
  .. .. .. ..- attr(*, "names")= chr [1:9] "diq010" "riagendr" "ridageyr" "ridreth1" ...
 $ call             : language glm(formula = diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl +      indhhin2 + bmxbmi + lbxglu, fa| __truncated__
 $ formula          :Class 'formula'  language diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl + indhhin2 +      bmxbmi + lbxglu
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 $ terms            :Classes 'terms', 'formula'  language diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl + indhhin2 +      bmxbmi + lbxglu
  .. ..- attr(*, "variables")= language list(diq010, riagendr, ridageyr, ridreth1, dmdeduc2, dmdmartl, indhhin2,      bmxbmi, lbxglu)
  .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. ..- attr(*, "term.labels")= chr [1:8] "riagendr" "ridageyr" "ridreth1" "dmdeduc2" ...
  .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(diq010, riagendr, ridageyr, ridreth1, dmdeduc2, dmdmartl, indhhin2,      bmxbmi, lbxglu)
  .. ..- attr(*, "dataClasses")= Named chr [1:9] "factor" "factor" "numeric" "factor" ...
  .. .. ..- attr(*, "names")= chr [1:9] "diq010" "riagendr" "ridageyr" "ridreth1" ...
 $ data             :'data.frame':  1126 obs. of  10 variables:
  ..- attr(*, "na.action")= 'omit' Named int [1:3843] 1 4 5 7 8 9 10 12 16 18 ...
  .. ..- attr(*, "names")= chr [1:3843] "1" "4" "5" "7" ...
 $ offset           : NULL
 $ control          :List of 3
 $ method           : chr "glm.fit"
 $ contrasts        :List of 5
 $ xlevels          :List of 5
 - attr(*, "class")= chr [1:2] "glm" "lm"
logit_model$coefficients     
                       (Intercept)                     riagendrFemale 
                        9.80788800                        -0.11463826 
                          ridageyr             ridreth1Other Hispanic 
                       -0.05502652                        -0.29722567 
        ridreth1Non-Hispanic White         ridreth1Non-Hispanic Black 
                       -0.38270466                        -0.92228099 
                     ridreth1Other              dmdeduc2Grades 9-11th 
                       -0.71990619                         0.44574943 
  dmdeduc2High school graduate/GED dmdeduc2Some college or AA degrees 
                        1.26584123                         0.90043329 
     dmdeduc2College grad or above                    dmdmartlWidowed 
                        0.94730185                         0.47483393 
                  dmdmartlDivorced                  dmdmartlSeparated 
                        0.36843309                        -0.04410658 
             dmdmartlNever married        dmdmartlLiving with partner 
                       -0.03576914                         0.17759058 
             indhhin2$5,000-$9,999            indhhin2$10,000-$14,999 
                        0.34427417                         1.52997906 
           indhhin2$15,000-$19,999            indhhin2$20,000-$24,999 
                        1.78220109                         1.08652981 
           indhhin2$25,000-$34,999            indhhin2$45,000-$54,999 
                        1.04876567                         0.75252721 
           indhhin2$65,000-$74,999                    indhhin220,000+ 
                        0.90368602                         1.03068174 
         indhhin2less than $20,000            indhhin2$75,000-$99,999 
                        2.72902414                         1.41227935 
                 indhhin2$100,000+                             bmxbmi 
                        0.77823046                        -0.05352607 
                            lbxglu 
                       -0.03825800 
logit_model$aic
[1] 609.2867
coeff_logit_model <- broom::tidy(logit_model)
coeff_logit_model
# A tibble: 29 × 5
   term                               estimate std.error statistic  p.value
   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                          9.81     1.14        8.58  9.58e-18
 2 riagendrFemale                      -0.115    0.239      -0.480 6.31e- 1
 3 ridageyr                            -0.0550   0.00933    -5.90  3.63e- 9
 4 ridreth1Other Hispanic              -0.297    0.429      -0.693 4.88e- 1
 5 ridreth1Non-Hispanic White          -0.383    0.393      -0.974 3.30e- 1
 6 ridreth1Non-Hispanic Black          -0.922    0.419      -2.20  2.77e- 2
 7 ridreth1Other                       -0.720    0.487      -1.48  1.40e- 1
 8 dmdeduc2Grades 9-11th                0.446    0.430       1.04  3.00e- 1
 9 dmdeduc2High school graduate/GED     1.27     0.401       3.16  1.58e- 3
10 dmdeduc2Some college or AA degrees   0.900    0.389       2.31  2.06e- 2
# ℹ 19 more rows
coeff_logit_model <- coeff_logit_model %>%
  mutate(odds_ratio = exp(estimate)) %>%
  arrange( -abs(estimate) )

coeff_logit_model
# A tibble: 29 × 6
   term                         estimate std.error statistic  p.value odds_ratio
   <chr>                           <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
 1 (Intercept)                     9.81      1.14       8.58 9.58e-18   18177.  
 2 indhhin2less than $20,000       2.73      1.14       2.39 1.68e- 2      15.3 
 3 indhhin2$15,000-$19,999         1.78      0.672      2.65 8.00e- 3       5.94
 4 indhhin2$10,000-$14,999         1.53      0.657      2.33 2.00e- 2       4.62
 5 indhhin2$75,000-$99,999         1.41      0.658      2.15 3.18e- 2       4.11
 6 dmdeduc2High school graduat…    1.27      0.401      3.16 1.58e- 3       3.55
 7 indhhin2$20,000-$24,999         1.09      0.640      1.70 8.94e- 2       2.96
 8 indhhin2$25,000-$34,999         1.05      0.590      1.78 7.55e- 2       2.85
 9 indhhin220,000+                 1.03      0.903      1.14 2.53e- 1       2.80
10 dmdeduc2College grad or abo…    0.947     0.427      2.22 2.64e- 2       2.58
# ℹ 19 more rows

10.4 Score Testing Data

Since training data was used to train the model, we use the hold-out data or testing data to evaluate the model performance:

probs <- predict(logit_model, diab_pop.no_na_vals.test, "response")
diab_pop.no_na_vals.test.SCORED <- cbind(diab_pop.no_na_vals.test, probs)

glimpse(diab_pop.no_na_vals.test.SCORED)
Rows: 750
Columns: 11
$ seqn     <dbl> 83734, 83737, 83757, 83761, 83789, 83820, 83822, 83823, 83834…
$ riagendr <fct> Male, Female, Female, Female, Male, Male, Female, Female, Mal…
$ ridageyr <dbl> 78, 72, 57, 24, 66, 70, 20, 29, 69, 71, 37, 49, 41, 54, 80, 3…
$ ridreth1 <fct> Non-Hispanic White, MexicanAmerican, Other Hispanic, Other, N…
$ dmdeduc2 <fct> High school graduate/GED, Grades 9-11th, Less than 9th grade,…
$ dmdmartl <fct> Married, Separated, Separated, Never married, Living with par…
$ indhhin2 <fct> "$20,000-$24,999", "$75,000-$99,999", "$20,000-$24,999", "$0-…
$ bmxbmi   <dbl> 28.8, 28.6, 35.4, 25.3, 34.0, 27.0, 22.2, 29.7, 28.2, 27.6, 3…
$ diq010   <fct> Diabetes, No Diabetes, Diabetes, No Diabetes, No Diabetes, No…
$ lbxglu   <dbl> 84, 107, 398, 95, 113, 94, 80, 102, 105, 76, 79, 126, 110, 99…
$ probs    <dbl> 9.387889e-01, 8.722288e-01, 5.437387e-05, 9.727618e-01, 8.427…

10.5 Use yardstick for Model Metrics

install_if_not('yardstick')
[1] "the package 'yardstick' is already installed"

Attaching package: 'yardstick'
The following objects are masked from 'package:caret':

    precision, recall, sensitivity, specificity
The following object is masked from 'package:readr':

    spec

10.6 Confusion Matrix

A Confusion Matrix is a table of counts of classses for predicted classes versus actual classes. Therefore, in order to create a confusion matrix we will need to specify a threshold value for the probability; let’s use .5 and the conf_mat function:

diab_pop.no_na_vals.test.SCORED <- diab_pop.no_na_vals.test.SCORED %>%
  mutate(Pred_DM2 = ifelse(probs <= 0.5, 1, 2)) %>%
  mutate(Pred_DM2 = as.factor(Pred_DM2))

levels(diab_pop.no_na_vals.test.SCORED$Pred_DM2) <- c('Diabetes' , 'No Diabetes')

glimpse(diab_pop.no_na_vals.test.SCORED)
Rows: 750
Columns: 12
$ seqn     <dbl> 83734, 83737, 83757, 83761, 83789, 83820, 83822, 83823, 83834…
$ riagendr <fct> Male, Female, Female, Female, Male, Male, Female, Female, Mal…
$ ridageyr <dbl> 78, 72, 57, 24, 66, 70, 20, 29, 69, 71, 37, 49, 41, 54, 80, 3…
$ ridreth1 <fct> Non-Hispanic White, MexicanAmerican, Other Hispanic, Other, N…
$ dmdeduc2 <fct> High school graduate/GED, Grades 9-11th, Less than 9th grade,…
$ dmdmartl <fct> Married, Separated, Separated, Never married, Living with par…
$ indhhin2 <fct> "$20,000-$24,999", "$75,000-$99,999", "$20,000-$24,999", "$0-…
$ bmxbmi   <dbl> 28.8, 28.6, 35.4, 25.3, 34.0, 27.0, 22.2, 29.7, 28.2, 27.6, 3…
$ diq010   <fct> Diabetes, No Diabetes, Diabetes, No Diabetes, No Diabetes, No…
$ lbxglu   <dbl> 84, 107, 398, 95, 113, 94, 80, 102, 105, 76, 79, 126, 110, 99…
$ probs    <dbl> 9.387889e-01, 8.722288e-01, 5.437387e-05, 9.727618e-01, 8.427…
$ Pred_DM2 <fct> No Diabetes, No Diabetes, Diabetes, No Diabetes, No Diabetes,…
conf_mat.logit <- diab_pop.no_na_vals.test.SCORED %>% 
  conf_mat(truth = diq010 , Pred_DM2 )
  
conf_mat.logit
             Truth
Prediction    Diabetes No Diabetes
  Diabetes          47          12
  No Diabetes       65         626
summary(conf_mat.logit)
# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary        0.897 
 2 kap                  binary        0.498 
 3 sens                 binary        0.420 
 4 spec                 binary        0.981 
 5 ppv                  binary        0.797 
 6 npv                  binary        0.906 
 7 mcc                  binary        0.531 
 8 j_index              binary        0.401 
 9 bal_accuracy         binary        0.700 
10 detection_prevalence binary        0.0787
11 precision            binary        0.797 
12 recall               binary        0.420 
13 f_meas               binary        0.550 

10.7 Receiver Operating Characteristic (ROC) Curve

The Receiver Operating Characteristic Curve or ROC Curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The function roc_auc can be used to obtain the estimated area under the ROC curve and the function roc_curve can be used to plot the curve:

roc_auc.logit <- diab_pop.no_na_vals.test.SCORED %>%
  roc_auc(truth=diq010 , probs, event_level = 'second')

roc_auc.logit
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.854
diab_pop.no_na_vals.test.SCORED %>%
  roc_curve(truth=diq010 , probs, event_level = 'second') %>%
  autoplot() + 
  labs( title = "ROC Curve - Logistic Regression",
        caption = paste0("Area Under ROC Curve : ",
                         round(roc_auc.logit$.estimate,3) )  ) +
  theme( plot.title = element_text(size = 18) , 
         plot.caption = element_text(size = 12))

diab_pop.no_na_vals.test.SCORED %>%
  roc_curve(truth=diq010 , probs, event_level = 'second') %>%
  mutate(one_minus_spec = 1- specificity ) %>%
  ggplot(aes(x=one_minus_spec, y = sensitivity)) + 
  geom_point() +
  labs( title = "ROC Curve - Logistic Regression",
        caption = paste0("Area Under ROC Curve : ",
                         round(roc_auc.logit$.estimate,3) )  ) +
  theme( plot.title = element_text(size = 18) , 
         plot.caption = element_text(size = 12))

10.8 Precision-Recall Curve

The Precision-Recall Curve is created by plotting the precision or PPV against the recall or TPR at various threshold settings. The function pr_auc can be used to obtain the estimated area under the Precision-Recall Curve and the function pr_curve can be used to plot the curve:

pr_auc.logit <- diab_pop.no_na_vals.test.SCORED %>%
  pr_auc(truth=diq010 , probs, event_level = 'second')

pr_auc.logit
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 pr_auc  binary         0.965
diab_pop.no_na_vals.test.SCORED %>%
  pr_curve(truth=diq010 , probs, event_level = 'second') %>%
  autoplot() + 
  labs( title = "Precision-Recall Curve - Logistic Regression",
        caption = paste0("Area Under Precision-Recall Curve : ", round(pr_auc.logit$.estimate,3) )  ) +
  theme( plot.title = element_text(size = 18) , 
         plot.caption = element_text(size = 12))

10.9 Lift Curve

lift_curve.logit <- diab_pop.no_na_vals.test.SCORED %>% 
  lift_curve(truth=diq010, probs, event_level = 'second') 
 
autoplot(lift_curve.logit) + 
  labs( title = "Lift Curve - Logistic Regression") +
  theme(plot.title = element_text(size = 18))

10.10 Gain Curve

gain_curve.logit <- diab_pop.no_na_vals.test.SCORED %>% 
  gain_curve(truth=diq010, probs, event_level = 'second') 
 
plot_gain_curve.logit <- autoplot(gain_curve.logit) + 
  labs( title = "Gain Curve - Logistic Regression") +
  theme(plot.title = element_text(size = 18))

plot_gain_curve.logit

11 Train Logistic Regression Model No Intercept

Let’s remove the intercept and the variable lbxglu :

features <- colnames(diab_pop.no_na_vals.train)[!colnames(diab_pop.no_na_vals.train) %in% c("seqn","diq010", "lbxglu")]

features_added <- paste0(features, collapse = " + ")

formula_2 <- paste0( "diq010 ~ ", features_added, " - 1"  )

formula_2
[1] "diq010 ~ riagendr + ridageyr + ridreth1 + dmdeduc2 + dmdmartl + indhhin2 + bmxbmi - 1"
logit_no_int_model <- glm(formula_2,
                          family="binomial",
                          data = diab_pop.no_na_vals.train)
                                             
logit_no_int_model

Call:  glm(formula = formula_2, family = "binomial", data = diab_pop.no_na_vals.train)

Coefficients:
                      riagendrMale                      riagendrFemale  
                           5.48435                             5.55488  
                          ridageyr              ridreth1Other Hispanic  
                          -0.05793                            -0.10096  
        ridreth1Non-Hispanic White          ridreth1Non-Hispanic Black  
                           0.23968                            -0.50310  
                     ridreth1Other               dmdeduc2Grades 9-11th  
                          -0.18278                             0.52285  
  dmdeduc2High school graduate/GED  dmdeduc2Some college or AA degrees  
                           0.84767                             0.82258  
     dmdeduc2College grad or above                     dmdmartlWidowed  
                           0.70671                             0.50172  
                  dmdmartlDivorced                   dmdmartlSeparated  
                           0.07506                            -0.10449  
             dmdmartlNever married         dmdmartlLiving with partner  
                           0.35830                             0.14843  
             indhhin2$5,000-$9,999             indhhin2$10,000-$14,999  
                           0.23534                             1.07971  
           indhhin2$15,000-$19,999             indhhin2$20,000-$24,999  
                           0.81281                             0.39890  
           indhhin2$25,000-$34,999             indhhin2$45,000-$54,999  
                           0.51096                             0.18778  
           indhhin2$65,000-$74,999                     indhhin220,000+  
                           0.99847                             0.51638  
         indhhin2less than $20,000             indhhin2$75,000-$99,999  
                           1.40105                             0.78987  
                 indhhin2$100,000+                              bmxbmi  
                           0.64519                            -0.06104  

Degrees of Freedom: 1126 Total (i.e. Null);  1098 Residual
Null Deviance:      1561 
Residual Deviance: 796.5    AIC: 852.5

11.1 Compare Models

Finally, we will compare the two models. First, score the test data with the model:

probs_no_int <- predict(logit_no_int_model, diab_pop.no_na_vals.test, "response")

Now we combine with our previous results and use a gather to stack the results into the column value:

diab_pop.no_na_vals.test.SCORED_Compare <- cbind(diab_pop.no_na_vals.test.SCORED, probs_no_int)

diab_pop.no_na_vals.test.SCORED_Compare <- diab_pop.no_na_vals.test.SCORED_Compare %>%
  tidyr::gather(probs, probs_no_int, key="model_type", value = "value")

glimpse(diab_pop.no_na_vals.test.SCORED_Compare)
Rows: 1,500
Columns: 13
$ seqn       <dbl> 83734, 83737, 83757, 83761, 83789, 83820, 83822, 83823, 838…
$ riagendr   <fct> Male, Female, Female, Female, Male, Male, Female, Female, M…
$ ridageyr   <dbl> 78, 72, 57, 24, 66, 70, 20, 29, 69, 71, 37, 49, 41, 54, 80,…
$ ridreth1   <fct> Non-Hispanic White, MexicanAmerican, Other Hispanic, Other,…
$ dmdeduc2   <fct> High school graduate/GED, Grades 9-11th, Less than 9th grad…
$ dmdmartl   <fct> Married, Separated, Separated, Never married, Living with p…
$ indhhin2   <fct> "$20,000-$24,999", "$75,000-$99,999", "$20,000-$24,999", "$…
$ bmxbmi     <dbl> 28.8, 28.6, 35.4, 25.3, 34.0, 27.0, 22.2, 29.7, 28.2, 27.6,…
$ diq010     <fct> Diabetes, No Diabetes, Diabetes, No Diabetes, No Diabetes, …
$ lbxglu     <dbl> 84, 107, 398, 95, 113, 94, 80, 102, 105, 76, 79, 126, 110, …
$ Pred_DM2   <fct> No Diabetes, No Diabetes, Diabetes, No Diabetes, No Diabete…
$ model_type <chr> "probs", "probs", "probs", "probs", "probs", "probs", "prob…
$ value      <dbl> 9.387889e-01, 8.722288e-01, 5.437387e-05, 9.727618e-01, 8.4…

Now we can use roc_auc and roc_curve with a group_by on the key we added in the last step:

roc_auc.logit <- diab_pop.no_na_vals.test.SCORED_Compare %>%
  group_by(model_type) %>%
  roc_auc(truth=diq010 , value, event_level = 'second')

roc_auc.logit
# A tibble: 2 × 4
  model_type   .metric .estimator .estimate
  <chr>        <chr>   <chr>          <dbl>
1 probs        roc_auc binary         0.854
2 probs_no_int roc_auc binary         0.753
diab_pop.no_na_vals.test.SCORED_Compare %>%
  group_by(model_type) %>%
  roc_curve(truth=diq010 , value, event_level = 'second') %>%
  autoplot() + 
  labs( title = "ROC Curves - Logistic Regression Compare ") +
  theme( plot.title = element_text(size = 18) , 
         plot.caption = element_text(size = 12))

We can see that the AUC of the first model is larger, additionally, the ROC curve of the second model is below the first, therefore the first model is the better choice.

12 Strengths & Weaknesses

Strengths Weakness
very common and standard approach strong assumptions about data
easy to understand and interpret model form must be specified in advance
flexibility, can be adopted for many situations does not handle missing data
provides estimates of both size and strength of the relationships among features and outcome only numeric features, categorical and strings need encodings
. understanding of statistics to fully understand model