7  Two Factor Classification with Categorical and Continuous Interactions

In this Chapter we will use the same dataset and will only consider non-missing values of Age, Gender, and Diabetic status.

We will follow much the same process of EDA with categorical features and making connections between the p-value of the chi square test and feature importance in logistic regression.

In addition, we will also have a more in-depth discussion around re-leveling logistic regression analysis and the impact it has on model interpret-ability.

This time we will have 3 primary model examples to compare:

  1. Logistic regression with rand_class feature
  2. Logistic regression with binned age and Gender features
  3. Logistic regression with Age interacted with Gender
Code
A_DATA <- readRDS(here::here('DATA/Part_2/A_DATA.RDS'))

library('tidyverse')
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
A_DATA <- A_DATA %>%
  select(SEQN, Age, Gender, DIABETES) %>%
  mutate(DIABETES_factor = as.factor(DIABETES)) %>%
  na.omit() 

A_DATA %>%
  glimpse()
Rows: 95,547
Columns: 5
$ SEQN            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ Age             <dbl> 2, 77, 10, 1, 49, 19, 59, 13, 11, 43, 15, 37, 70, 81, …
$ Gender          <chr> "Female", "Male", "Female", "Male", "Male", "Female", …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …

7.1 Bin Age

Additionally, we will create deciles for the Age range:

Code
A_DATA <- A_DATA %>%
  mutate(age_bucket = as.factor(ntile(Age,10)))

A_DATA %>%
  head()
# A tibble: 6 × 6
   SEQN   Age Gender DIABETES DIABETES_factor age_bucket
  <dbl> <dbl> <chr>     <dbl> <fct>           <fct>     
1     1     2 Female        0 0               1         
2     2    77 Male          0 0               10        
3     3    10 Female        0 0               3         
4     4     1 Male          0 0               1         
5     5    49 Male          0 0               8         
6     6    19 Female        0 0               5         

Here’s a nice summary table of the various Age buckets we created:

Code
Age_bucket_Table <- A_DATA %>%
  select(Age, age_bucket) %>%
  group_by(age_bucket) %>%
  summarise(min_age = min(Age),
            mean_age = round(mean(Age),2),
            median_age = median(Age),
            max_age = max(Age),
            .groups = 'keep') %>%
  ungroup() %>%
  mutate(age_range = paste0(min_age, ' <= Age <= ', max_age )) %>%
  select(-min_age, -max_age) %>%
  arrange(age_bucket)

Age_bucket_Table %>%
  print(n=10)
# A tibble: 10 × 4
   age_bucket mean_age median_age age_range      
   <fct>         <dbl>      <dbl> <chr>          
 1 1              2.22          2 1 <= Age <= 4  
 2 2              6.52          7 4 <= Age <= 9  
 3 3             11.2          11 9 <= Age <= 13 
 4 4             15.6          16 13 <= Age <= 18
 5 5             21.2          21 18 <= Age <= 26
 6 6             31.0          31 26 <= Age <= 36
 7 7             41.4          41 36 <= Age <= 47
 8 8             52.4          52 47 <= Age <= 59
 9 9             63.9          64 59 <= Age <= 70
10 10            77.3          78 70 <= Age <= 85

7.2 Add noise feature

Additionally, we add a variable rand_class to the data-set. rand_class will be sampled from the options of Rand_A or Rand_B. We will then randomly sort the data, and for every 5th row we will assign Rand_C to rand_class. Our reasons for adding this random feature will be to compare and contrast a random information to information that is linked with the issue at hand.

Code
set.seed(8675309)

A_DATA$rand_class <- sample(c('Rand_A','Rand_B'), 
                            size = nrow(A_DATA),  
                            replace = TRUE)

A_DATA$rand_sort <- runif(nrow(A_DATA))

A_DATA <- A_DATA %>%
  arrange(rand_sort) %>%
  mutate(rn = row_number()) %>%
  mutate(rn_mod_5 = rn %% 5 ) %>% 
  mutate(rand_class = if_else( rn_mod_5 == 0, 
                              "Rand_C", 
                              rand_class)) %>%
  select(-rn, -rn_mod_5, -rand_sort) %>%
  mutate(rand_class = as.factor(rand_class))
Code
A_DATA %>%
  head()
# A tibble: 6 × 7
   SEQN   Age Gender DIABETES DIABETES_factor age_bucket rand_class
  <dbl> <dbl> <chr>     <dbl> <fct>           <fct>      <fct>     
1 83229    10 Male          0 0               3          Rand_A    
2  6917    11 Male          0 0               3          Rand_A    
3  9902    23 Female        0 0               5          Rand_A    
4 81122    37 Female        0 0               7          Rand_A    
5 43687     1 Female        0 0               1          Rand_C    
6 95673    80 Male          0 0               10         Rand_B    

7.3 dplyr statistics

For categorical data we are typically reviewing counts between various groups:

Code
A_DATA %>%
  group_by(DIABETES, Gender) %>%
  tally() %>%
  pivot_wider(names_from = Gender, values_from = n)
# A tibble: 2 × 3
# Groups:   DIABETES [2]
  DIABETES Female  Male
     <dbl>  <int> <int>
1        0  45188 43552
2        1   3371  3436
Code
A_DATA %>%
  group_by(DIABETES, rand_class) %>%
  tally() %>%
  pivot_wider(names_from = rand_class, values_from = n)
# A tibble: 2 × 4
# Groups:   DIABETES [2]
  DIABETES Rand_A Rand_B Rand_C
     <dbl>  <int>  <int>  <int>
1        0  35434  35580  17726
2        1   2682   2742   1383

7.4 chisq.test

Suppose that \(n\) observations in a random sample from a population are classified into \(k\) mutually exclusive classes with respective observed numbers \(O_i\) (for \(i = 1, 2, \ldots, k\)).

The null hypothesis in the chi-square test assumes \(p_i\) is the probability that an observation falls into the \(i\)-th class.

Therefore, for all \(i\), we have an expected frequencies: \[E_{i} = n \cdot p_{i}\]

Note since the \(p_i\) are probabilities we have that:

\[\sum_{i=1}^k p_i = 1 \] Since there are \(n\) observations we have:

\[ n = n \sum_{i=1}^k p_i = \sum_{i=1}^k n \cdot p_i = \sum_{i=1}^k E_{i} \]

However, we have observed values:

\[ \sum_{i=1}^k O_i = n \]

Therefore, under the null hypothesis of the chi-square we have that:

\[ \sum_{i=1}^k E_{i} = n \sum_{i=1}^kp_i = \sum_{i=1}^kO_i \] For the chi-squared test statistic We define :

\[ \chi^2 \sim \sum_{i=1}^k\frac{(O_i - E_i)^2}{E_i} = n \cdot \sum_{i=1}^k \frac{ \left( \frac{O_i}{n} - p_i \right)^2 }{p_i}\]

where \(\chi^2\) is the chi-square distribution with:

  • \(k-1\) degrees of freedom
  • \(O_{i}\) = the number of observations of type \(i\).
  • \(E_{i} = n \cdot p_{i}\) = the expected (theoretical) frequency of type \(i\), asserted by the null hypothesis that the fraction of type \(i\) in the population is \(p_{i}\)

7.5 Gender

In this section we will compare t-tests on Gender between groups of DIABETES.

First we can create a table of Gender and DIABETES:

Code
tbl_DM2_Gender <- table(A_DATA$DIABETES, A_DATA$Gender)

tbl_DM2_Gender
   
    Female  Male
  0  45188 43552
  1   3371  3436

Notice this gives the same result as our dplyr query.

Let’s compare the mean ages between the Diabetic and non-Diabetic populations, here’s a look at the output:

Code
chisq.tbl_DM2_Gender <- table(A_DATA$DIABETES, A_DATA$Gender) |>
  chisq.test()
Code
chisq.tbl_DM2_Gender

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(A_DATA$DIABETES, A_DATA$Gender)
X-squared = 4.8966, df = 1, p-value = 0.02691

let’s review the structure

Code
str(chisq.tbl_DM2_Gender,1)
List of 9
 $ statistic: Named num 4.9
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named int 1
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.0269
 $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
 $ data.name: chr "table(A_DATA$DIABETES, A_DATA$Gender)"
 $ observed : 'table' int [1:2, 1:2] 45188 3371 43552 3436
  ..- attr(*, "dimnames")=List of 2
 $ expected : num [1:2, 1:2] 45100 3459 43640 3348
  ..- attr(*, "dimnames")=List of 2
 $ residuals: 'table' num [1:2, 1:2] 0.417 -1.504 -0.423 1.529
  ..- attr(*, "dimnames")=List of 2
 $ stdres   : 'table' num [1:2, 1:2] 2.23 -2.23 -2.23 2.23
  ..- attr(*, "dimnames")=List of 2
 - attr(*, "class")= chr "htest"

For instance here are the observed:

Code
chisq.tbl_DM2_Gender$observed
   
    Female  Male
  0  45188 43552
  1   3371  3436

Note this is the same as tbl_DM2_Gender

Code
tbl_DM2_Gender == chisq.tbl_DM2_Gender$observed
   
    Female Male
  0   TRUE TRUE
  1   TRUE TRUE

The expected distribution should be:

Code
chisq.tbl_DM2_Gender$expected
   
       Female      Male
  0 45099.539 43640.461
  1  3459.461  3347.539

The residulas are Person residuals and are defined as (observed - expected) / sqrt(expected):

Code
chisq.tbl_DM2_Gender$residuals 
   
        Female       Male
  0  0.4165484 -0.4234546
  1 -1.5039980  1.5289337
Code
(chisq.tbl_DM2_Gender$observed - chisq.tbl_DM2_Gender$expected) / sqrt(chisq.tbl_DM2_Gender$expected ) == chisq.tbl_DM2_Gender$residuals  
   
    Female Male
  0   TRUE TRUE
  1   TRUE TRUE

7.5.1 p-value review

Review the p-value of the chi-square test between two groups.

The null hypothesis is that each person’s gender is independent of the person’s diabetic status.

If the p-value is

  1. greater than .05 then we accept the null-hypothesis
  2. less than .05 then we reject the null-hypothesis

7.5.2 broom

We can tidy-up the results if we utilize the broom package. Recall that broom::tidy tells R to look in the broom package for the tidy function. The tidy function in the broom library converts many base R outputs to tibbles.

Code
tibble_chisq.tbl_DM2_Gender <- broom::tidy(chisq.tbl_DM2_Gender) %>%
  mutate(dist_1 = "Non_DM2.Gender") %>%
  mutate(dist_2 = "DM2.Gender")

the broom::tidy transforms the list return of chisq.test output into a tibble we can also add variable to help us remember where the data came from.

Code
tibble_chisq.tbl_DM2_Gender %>%
  glimpse()
Rows: 1
Columns: 6
$ statistic <dbl> 4.896648
$ p.value   <dbl> 0.02690888
$ parameter <int> 1
$ method    <chr> "Pearson's Chi-squared test with Yates' continuity correctio…
$ dist_1    <chr> "Non_DM2.Gender"
$ dist_2    <chr> "DM2.Gender"

Our current p-value is tibble_chisq.tbl_DM2_Gender$p.value = 0.0269089, since the p-value is less than .05 we reject the null-hypothesis, Gender is not independent of the person’s diabetic status.

7.6 Associated \(\chi^2\) Plots

7.6.1 Balloon Plot

This balloon plot shows the frequency table, now the area of the circle is proportional to the frequency counts.

Code
gplots::balloonplot(tbl_DM2_Gender,
                    main ="Balloon Plot for Gender by Diabetes \n Area is proportional to Freq.") 
Figure 7.1: Balloon Plot for Gender by Diabetes where area is proportional to Freq.

7.6.2 vcd::mosaic

With the vcd::mosaic function you get another “area graph” this time with the groups highlighted by color:

Code
library(vcd)
Loading required package: grid
Code
mosaic(A_DATA$DIABETES ~ A_DATA$Gender,
       gp = shading_max,
       main = "Mosaic plot of Gender by Diabetic Status")
Figure 7.2: Mosaic plot of Gender by Diabetic Status.

In the vcd::mosaic plot:

  • Blue indicates that the observed value is higher than the expected value if the data were random.
  • Red color specifies that the observed value is lower than the expected value if the data were random.

For example, from this mosaic plot, it can be seen that the number of Diabetic males is higher than expected.

7.6.3 corrplot::corrplot

Code
corrplot::corrplot(chisq.tbl_DM2_Gender$residuals,
                   is.cor = FALSE)
Figure 7.3: graphical display of a correlation matrix we set is.cor = FALSE to tell R this is a general matrix

In the corrplot::corrplot plot:

  • Positive residuals are in blue. Positive values specify an attraction (positive association) between the corresponding row and column variables. In the image above, we can see there are an association between DIABETES and being a Male.

  • Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the DIABETICs are negatively associated (or “not associated”) with the Females.

7.7 rand_class chisq.test

As we did with Gender above, we can extract lists of rand_classs for each type of DIABETES:

Code
tbl_DM2_rand_class <- table(A_DATA$DIABETES, A_DATA$rand_class)
tbl_DM2_rand_class
   
    Rand_A Rand_B Rand_C
  0  35434  35580  17726
  1   2682   2742   1383

Let’s compute chisq.test for rand_class and the Diabetic and non-Diabetic populations:

Code
chisq.test.tbl_DM2_rand_class <- chisq.test(tbl_DM2_rand_class)
chisq.test.tbl_DM2_rand_class

    Pearson's Chi-squared test

data:  tbl_DM2_rand_class
X-squared = 0.86969, df = 2, p-value = 0.6474
Code
tibble_chisq.test.tbl_DM2_rand_class <- broom::tidy(chisq.test.tbl_DM2_rand_class) %>%
  mutate(dist_1 = "Non_DM2.rand_class") %>%
  mutate(dist_2 = "DM2.rand_class")

tibble_chisq.test.tbl_DM2_rand_class %>%
  glimpse()
Rows: 1
Columns: 6
$ statistic <dbl> 0.8696855
$ p.value   <dbl> 0.6473665
$ parameter <int> 2
$ method    <chr> "Pearson's Chi-squared test"
$ dist_1    <chr> "Non_DM2.rand_class"
$ dist_2    <chr> "DM2.rand_class"

7.7.1 plots

Code
gplots::balloonplot(tbl_DM2_rand_class, 
                    main ="Balloon Plot for rand_class by Diabetes \n Area is proportional to Freq.")

Code
mosaic(A_DATA$DIABETES ~ A_DATA$rand_class, 
       gp = shading_max,
       main = "Mosaic plot of rand_class by Diabetic status")

Again, notice the lack of color and the p-value of 0.6473665

Code
corrplot::corrplot(chisq.test.tbl_DM2_rand_class$residuals, 
                   is.cor = FALSE)

We can see there are an association between DIABETES and being in Rand_C, however, it’s not a strong enough to give us a statistically significant chi-square p-value.

Code
bind_rows(tibble_chisq.tbl_DM2_Gender,
          tibble_chisq.test.tbl_DM2_rand_class) %>%
  select(statistic, p.value, dist_1, dist_2) 
# A tibble: 2 × 4
  statistic p.value dist_1             dist_2        
      <dbl>   <dbl> <chr>              <chr>         
1     4.90   0.0269 Non_DM2.Gender     DM2.Gender    
2     0.870  0.647  Non_DM2.rand_class DM2.rand_class

We see that the p.value is near 0 for Gender but close to 1 for rand_class, our interpretation is that Gender has a statistically significant impact on Diabetic status whereas rand_class does not.

7.8 Non-Missing Data

Code
A_DATA.no_miss <- A_DATA %>%
  select(SEQN, DIABETES, DIABETES_factor, Age, age_bucket, Gender, rand_class) %>%
  mutate(DIABETES_factor = fct_relevel(DIABETES_factor, c('0','1') ) ) %>%
  na.omit()

7.8.1 ANOVA

Code
aov(DIABETES ~ age_bucket, 
    data = A_DATA.no_miss) %>%
  summary()
               Df Sum Sq Mean Sq F value Pr(>F)    
age_bucket      9    786   87.35    1507 <2e-16 ***
Residuals   95537   5536    0.06                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
aov(DIABETES ~ Gender, 
    data = A_DATA.no_miss) %>%
  summary()
               Df Sum Sq Mean Sq F value Pr(>F)  
Gender          1      0  0.3277   4.953 0.0261 *
Residuals   95545   6322  0.0662                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
aov(DIABETES ~ rand_class, 
    data = A_DATA.no_miss) %>%
  summary()
               Df Sum Sq Mean Sq F value Pr(>F)
rand_class      2      0 0.02877   0.435  0.647
Residuals   95544   6322 0.06617               
Code
aov(DIABETES ~ rand_class + age_bucket + Gender, 
    data = A_DATA.no_miss) %>%
  summary()
               Df Sum Sq Mean Sq  F value  Pr(>F)    
rand_class      2      0    0.03    0.497 0.60861    
age_bucket      9    786   87.35 1507.565 < 2e-16 ***
Gender          1      1    0.54    9.284 0.00231 ** 
Residuals   95534   5535    0.06                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.8.2 split data

Code
sample.SEQN <- sample(A_DATA.no_miss$SEQN,
                      size = nrow(A_DATA.no_miss)*.6,
                      replace = FALSE)
Code
A_DATA.train <- A_DATA.no_miss %>%
  filter(SEQN %in% sample.SEQN)

A_DATA.test <- A_DATA.no_miss %>%
  filter(!(SEQN %in% sample.SEQN))

7.9 Train Logistic Regression Models

7.9.1 RAND

Code
RAND <- glm(DIABETES_factor ~ rand_class,
                     data= A_DATA.train,
                     family = "binomial")

7.9.1.1 RAND Model Output

Code

Call:  glm(formula = DIABETES_factor ~ rand_class, family = "binomial", 
    data = A_DATA.train)

Coefficients:
     (Intercept)  rand_classRand_B  rand_classRand_C  
       -2.590846          0.009805          0.049951  

Degrees of Freedom: 57327 Total (i.e. Null);  57325 Residual
Null Deviance:      29270 
Residual Deviance: 29270    AIC: 29280
Code
summary(RAND)

Call:
glm(formula = DIABETES_factor ~ rand_class, family = "binomial", 
    data = A_DATA.train)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -2.590846   0.025936 -99.893   <2e-16 ***
rand_classRand_B  0.009805   0.036578   0.268    0.789    
rand_classRand_C  0.049951   0.044324   1.127    0.260    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29272  on 57327  degrees of freedom
Residual deviance: 29271  on 57325  degrees of freedom
AIC: 29277

Number of Fisher Scoring iterations: 5
7.9.1.1.1 RAND training errors table
Code
RAND %>%
  broom::glance()
# A tibble: 1 × 8
  null.deviance df.null  logLik    AIC    BIC deviance df.residual  nobs
          <dbl>   <int>   <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
1        29272.   57327 -14636. 29277. 29304.   29271.       57325 57328
7.9.1.1.2 RAND coefficent table
Code
RAND.CoEff <- RAND %>%
 broom::tidy() %>%
    mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "RAND")

RAND.CoEff %>%
  knitr::kable()
term estimate std.error statistic p.value odds_ratio model
(Intercept) -2.5908458 0.0259362 -99.8930631 0.0000000 NA RAND
rand_classRand_B 0.0098048 0.0365777 0.2680539 0.7886578 1.0099 RAND
rand_classRand_C 0.0499514 0.0443244 1.1269487 0.2597642 1.0512 RAND

Notice that when we filter for significant features we find only the intercept:

Code
RAND %>%
 broom::tidy() %>%
  filter(p.value <= .05) %>%
  mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) 
# A tibble: 1 × 6
  term        estimate std.error statistic p.value odds_ratio
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
1 (Intercept)    -2.59    0.0259     -99.9       0         NA

7.9.2 DISCREATE

Code
A_DATA.train %>%
  group_by(DIABETES_factor, Gender,  rand_class, age_bucket) %>%
  summarise(Pop_N = n_distinct(SEQN)) %>%
  ggplot(aes(x = age_bucket , y = rand_class , size = Pop_N,   color=DIABETES_factor)) +
  geom_point() + 
  facet_grid(DIABETES_factor ~ rand_class + Gender)
`summarise()` has grouped output by 'DIABETES_factor', 'Gender', 'rand_class'.
You can override using the `.groups` argument.

Code
Interactive_DISCREATE <- A_DATA.train %>%
  group_by(DIABETES_factor, Gender,  rand_class, age_bucket) %>%
  summarise(Pop_N = n_distinct(SEQN)) %>%
  ggplot(aes(x = age_bucket , y = rand_class , size = Pop_N,   color=DIABETES_factor)) +
  geom_point() + 
  facet_wrap(~Gender)
`summarise()` has grouped output by 'DIABETES_factor', 'Gender', 'rand_class'.
You can override using the `.groups` argument.
Code
Interactive_DISCREATE

Most evident from the above graphs is the effect the age_bucket has on the number of diabetics, we can see that as the age buckets increase from 1 to 10 the size of the population bubble also increases.

We can run an anova test on these features:

Code
aov(DIABETES ~ rand_class + age_bucket + Gender, 
    data = A_DATA.no_miss) %>%
  summary()
               Df Sum Sq Mean Sq  F value  Pr(>F)    
rand_class      2      0    0.03    0.497 0.60861    
age_bucket      9    786   87.35 1507.565 < 2e-16 ***
Gender          1      1    0.54    9.284 0.00231 ** 
Residuals   95534   5535    0.06                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, our findings here indicated that Gender and age_bucket will have an effect on Diabetes but that rand_class will not.

Code
DISCREATE <- glm( DIABETES_factor ~ age_bucket + Gender,
                     data= A_DATA.train,
                     family = "binomial")

7.9.2.1 DISCREATE Model Output

Let’s take a moment to review the coefficient table for this DISCRETE model:

Code
DISCREATE.coeff <- DISCREATE %>%
  tidy() %>%
  mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "DISCREATE")

DISCREATE.coeff %>%
  knitr::kable()
term estimate std.error statistic p.value odds_ratio model
(Intercept) -8.0285860 0.7073791 -11.349764 0.0000000 NA DISCREATE
age_bucket2 1.4060198 0.7907009 1.778194 0.0753720 4.0797 DISCREATE
age_bucket3 2.5648976 0.7349907 3.489701 0.0004836 12.9993 DISCREATE
age_bucket4 2.7260585 0.7304389 3.732083 0.0001899 15.2726 DISCREATE
age_bucket5 3.0445348 0.7242929 4.203458 0.0000263 21.0003 DISCREATE
age_bucket6 4.2239857 0.7125924 5.927632 0.0000000 68.3052 DISCREATE
age_bucket7 5.3082911 0.7091607 7.485315 0.0000000 202.0047 DISCREATE
age_bucket8 6.1521485 0.7081562 8.687558 0.0000000 469.7255 DISCREATE
age_bucket9 6.7720159 0.7078288 9.567308 0.0000000 873.0702 DISCREATE
age_bucket10 6.7696177 0.7078238 9.563987 0.0000000 870.9788 DISCREATE
GenderMale 0.1038955 0.0348196 2.983822 0.0028467 1.1095 DISCREATE

As we expected, as you get older your risk of diabetes, increases.

Warning: Unknown or uninitialised column: `age_rang`.

However, these odds ratios may not be as helpful as we would like them to be for interpretation. For instance, right now we can say that when a patient is in the highest age bucket (70 <= Age <= 85) then the odds of having a diabetes increases by 870.9788 times compared to patients in the lowest age bucket ; so having this as the basis of comparison is not as helpful as it could be.

7.9.2.2 re-leveling factors in the DISCREATE model

Sometimes it is helpful to re-leveling factors in R; within the context of logistic regression we will be able to reset our comparison group so that we have a more even basis of comparison - let’s re-level right on the 36 <= Age <= 47 age bucket:

Code
discrete_releveled <- A_DATA.train %>%
  mutate(age_bucket = fct_relevel(age_bucket,'7'))

now we can fit a new logistic regression

Code
DISCREATE_r <- glm( DIABETES_factor ~ age_bucket + Gender,
                     data= discrete_releveled,
                     family = "binomial")
Code
DISCREATE_r.coeff <- DISCREATE_r %>%
 broom::tidy() %>%
    mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "Refactored DISCREATE")
Code
DISCREATE_r.coeff <- Age_bucket_Table %>%
  mutate(term = paste0('age_bucket',age_bucket)) %>%
  left_join(DISCREATE_r.coeff) %>%
  mutate(model = "Refactored DISCREATE")
Joining with `by = join_by(term)`
Code
DISCREATE_r.coeff %>%  
  knitr::kable()
age_bucket mean_age median_age age_range term estimate std.error statistic p.value odds_ratio model
1 2.22 2 1 <= Age <= 4 age_bucket1 -5.3082911 0.7091607 -7.485315 0 0.0050 Refactored DISCREATE
2 6.52 7 4 <= Age <= 9 age_bucket2 -3.9022713 0.3578434 -10.904970 0 0.0202 Refactored DISCREATE
3 11.18 11 9 <= Age <= 13 age_bucket3 -2.7433936 0.2074982 -13.221286 0 0.0644 Refactored DISCREATE
4 15.64 16 13 <= Age <= 18 age_bucket4 -2.5822326 0.1907514 -13.537160 0 0.0756 Refactored DISCREATE
5 21.24 21 18 <= Age <= 26 age_bucket5 -2.2637563 0.1656540 -13.665573 0 0.1040 Refactored DISCREATE
6 31.02 31 26 <= Age <= 36 age_bucket6 -1.0843054 0.1030945 -10.517590 0 0.3381 Refactored DISCREATE
7 41.43 41 36 <= Age <= 47 age_bucket7 NA NA NA NA NA Refactored DISCREATE
8 52.36 52 47 <= Age <= 59 age_bucket8 0.8438574 0.0657799 12.828506 0 2.3253 Refactored DISCREATE
9 63.91 64 59 <= Age <= 70 age_bucket9 1.4637248 0.0621667 23.545157 0 4.3220 Refactored DISCREATE
10 77.32 78 70 <= Age <= 85 age_bucket10 1.4613266 0.0620938 23.534191 0 4.3117 Refactored DISCREATE

Notice that age_bucket7 is missing from the above coefficient table and now we have age_bucket1 this is because we re-factored to level 7, so now all of the odds ratios are computed relative to the age group 36 <= Age <= 47.

When the odds ratio is 1, this means the coefficient in the logistic regression model is ln(1) = 0; meaning changes in the feature has no impact on the outcome.

If a patient is in the highest age bucket, then the odds of having an diabetes increases by 4.3117 times compared to patients aged 36 <= Age <= 47.

In addition to looking for large coefficients or odds ratios we can also review for odds ratios which are between 1 and 0 - these features will have the opposite effect; in that having them reduces the odds of the outcome compared to the baseline group.

So for this example, patient’s in the lowest age bracket decreases their likelihood of having diabetes by 99.5% compared to patients aged 36 <= Age <= 47.

7.9.3 INTERACTION

We can get interaction terms by using * notion:

Code
INTERACTION <- glm(DIABETES_factor  ~ Age + Age*Gender ,
                     data= A_DATA.train,
                     family = "binomial") 

7.9.3.1 INTERACTION Coefficent Table

Code
INTERACTION

Call:  glm(formula = DIABETES_factor ~ Age + Age * Gender, family = "binomial", 
    data = A_DATA.train)

Coefficients:
   (Intercept)             Age      GenderMale  Age:GenderMale  
     -5.148106        0.054292       -0.156792        0.004703  

Degrees of Freedom: 57327 Total (i.e. Null);  57324 Residual
Null Deviance:      29270 
Residual Deviance: 23090    AIC: 23100
Code
INTERACTION.CoEff <- INTERACTION %>%
 broom::tidy() %>% 
 mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
 mutate(model= "Interaction")

INTERACTION.CoEff %>%
  knitr::kable()
term estimate std.error statistic p.value odds_ratio model
(Intercept) -5.1481064 0.0743099 -69.278851 0.0000000 NA Interaction
Age 0.0542923 0.0011942 45.461946 0.0000000 1.0558 Interaction
GenderMale -0.1567921 0.1074018 -1.459865 0.1443271 0.8549 Interaction
Age:GenderMale 0.0047026 0.0017205 2.733323 0.0062699 1.0047 Interaction

7.10 Compare Models

7.10.1 Coefficent Compairson

Code
CoEff_Compare <- bind_rows(RAND.CoEff,
          INTERACTION.CoEff,
          DISCREATE.coeff,
          DISCREATE_r.coeff)

CoEff_Compare %>%
  glimpse()
Rows: 28
Columns: 11
$ term       <chr> "(Intercept)", "rand_classRand_B", "rand_classRand_C", "(In…
$ estimate   <dbl> -2.590845847, 0.009804784, 0.049951350, -5.148106438, 0.054…
$ std.error  <dbl> 0.025936194, 0.036577659, 0.044324423, 0.074309928, 0.00119…
$ statistic  <dbl> -99.8930631, 0.2680539, 1.1269487, -69.2788510, 45.4619460,…
$ p.value    <dbl> 0.000000e+00, 7.886578e-01, 2.597642e-01, 0.000000e+00, 0.0…
$ odds_ratio <dbl> NA, 1.0099, 1.0512, NA, 1.0558, 0.8549, 1.0047, NA, 4.0797,…
$ model      <chr> "RAND", "RAND", "RAND", "Interaction", "Interaction", "Inte…
$ age_bucket <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ mean_age   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ median_age <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ age_range  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Code
CoEff_Compare %>%
  mutate(Feature = reorder(term, estimate)) %>%
  ggplot(aes(y=Feature, x=estimate, color=model)) +
  geom_point() +
  geom_errorbarh(aes(xmax = estimate + std.error, xmin = estimate - std.error, height = .2)) +
  ggtitle(expression("Comparing model estimates of " ~ beta[i]))
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_errorbarh()`).

Code
CoEff_Compare %>%
  mutate(Feature = reorder(term, odds_ratio)) %>%
  ggplot(aes(y=Feature, x=odds_ratio, color=model)) +
  geom_point() +
  geom_errorbarh(aes(xmax = exp(estimate + std.error), xmin = exp(estimate - std.error), height = .2)) +                  
  ggtitle(expression("Comparing model estimates of Odds Ratios" ~ exp(beta[i])))
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_errorbarh()`).

Code
CoEff_Compare %>%
  mutate(Feature = reorder(term, p.value)) %>%
  ggplot(aes( y = Feature, x = p.value , fill=model)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = "Comparing model Coefficent p-values")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

7.10.2 Model Training Errors

Code
model_train_errors <- bind_rows(
  RAND %>% glance() %>% mutate(model = 'RAND'),
  DISCREATE %>% glance() %>% mutate(model = 'DISCREATE'),
  INTERACTION %>% glance() %>% mutate(model = 'INTERACTION')
)

model_train_errors %>%
  select(model, AIC, BIC, logLik, colnames(model_train_errors)) %>%
  arrange(AIC)
# A tibble: 3 × 9
  model      AIC    BIC  logLik null.deviance df.null deviance df.residual  nobs
  <chr>    <dbl>  <dbl>   <dbl>         <dbl>   <int>    <dbl>       <int> <int>
1 DISCRE… 22375. 22474. -11177.        29272.   57327   22353.       57317 57328
2 INTERA… 23102. 23138. -11547.        29272.   57327   23094.       57324 57328
3 RAND    29277. 29304. -14636.        29272.   57327   29271.       57325 57328

7.11 Logit Model Score Helper Function

We have three models (RAND, DISCRETE, & INTERACTION), and in each of the models we need to:

  1. score the training dataset
  2. find a threshold
  3. score the test data, and use threshold to make define predicted classes

If we review the str of the RAND model:

Code
str(RAND,1)
List of 30
 $ coefficients     : Named num [1:3] -2.5908 0.0098 0.05
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "rand_classRand_B" "rand_classRand_C"
 $ residuals        : Named num [1:57328] -1.07 -1.07 -1.07 -1.08 -1.07 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ fitted.values    : Named num [1:57328] 0.0697 0.0697 0.0697 0.0704 0.0697 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ effects          : Named num [1:57328] 158.078 -0.219 1.127 -0.275 -0.271 ...
  ..- attr(*, "names")= chr [1:57328] "(Intercept)" "rand_classRand_B" "rand_classRand_C" "" ...
 $ R                : num [1:3, 1:3] -61.3 0 0 -24.5 30 ...
  ..- attr(*, "dimnames")=List of 2
 $ rank             : int 3
 $ qr               :List of 5
  ..- attr(*, "class")= chr "qr"
 $ family           :List of 13
  ..- attr(*, "class")= chr "family"
 $ linear.predictors: Named num [1:57328] -2.59 -2.59 -2.59 -2.58 -2.59 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ deviance         : num 29271
 $ aic              : num 29277
 $ null.deviance    : num 29272
 $ iter             : int 5
 $ weights          : Named num [1:57328] 0.0649 0.0649 0.0649 0.0654 0.0649 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ prior.weights    : Named num [1:57328] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ df.residual      : int 57325
 $ df.null          : int 57327
 $ y                : Named num [1:57328] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  57328 obs. of  2 variables:
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language DIABETES_factor ~ rand_class
  .. .. ..- attr(*, "variables")= language list(DIABETES_factor, rand_class)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. ..- attr(*, "term.labels")= chr "rand_class"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(DIABETES_factor, rand_class)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "factor" "factor"
  .. .. .. ..- attr(*, "names")= chr [1:2] "DIABETES_factor" "rand_class"
 $ call             : language glm(formula = DIABETES_factor ~ rand_class, family = "binomial", data = A_DATA.train)
 $ formula          :Class 'formula'  language DIABETES_factor ~ rand_class
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 $ terms            :Classes 'terms', 'formula'  language DIABETES_factor ~ rand_class
  .. ..- attr(*, "variables")= language list(DIABETES_factor, rand_class)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. ..- attr(*, "term.labels")= chr "rand_class"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(DIABETES_factor, rand_class)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "factor" "factor"
  .. .. ..- attr(*, "names")= chr [1:2] "DIABETES_factor" "rand_class"
 $ data             : tibble [57,328 × 7] (S3: tbl_df/tbl/data.frame)
  ..- attr(*, "na.action")= 'omit' Named int [1:5769] 19 30 46 48 51 63 99 119 160 173 ...
  .. ..- attr(*, "names")= chr [1:5769] "19" "30" "46" "48" ...
 $ offset           : NULL
 $ control          :List of 3
 $ method           : chr "glm.fit"
 $ contrasts        :List of 1
 $ xlevels          :List of 1
 - attr(*, "class")= chr [1:2] "glm" "lm"

we notice that there is a $data this is the training data stored in the model object RAND

Code
arsenal::comparedf(A_DATA.train, RAND$data)
Compare Object

Function Call: 
arsenal::comparedf(x = A_DATA.train, y = RAND$data)

Shared: 7 non-by variables and 57328 observations.
Not shared: 0 variables and 0 observations.

Differences found in 0/7 variables compared.
0 variables compared have non-identical attributes.

Rather than performing the same operation 3x in a row we will use that fact the training data exists within the glm model output below to functionalize our scoring procedure from the last chapter, we encourage you to review sections Section 6.12.2 and Section 6.12.3; notice the changes made from there to the function below.

We create a function logit_model_scorer as follows:

Code
logit_model_scorer <- function(my_model, my_data){ 
  # extracts models name 
  my_model_name <- deparse(substitute(my_model))
  
  # store model name into a new column called model
  data.s <- my_data %>%
    mutate(model = my_model_name)
  
  # store the training data someplace
  train_data.s <- my_model$data
  
  # score the training data 
  train_data.s$probs <- predict(my_model, 
                                train_data.s, 
                                'response')
  
  # threshold query
  threshold_value_query <- train_data.s %>% 
                        group_by(DIABETES_factor) %>%
                        summarise(mean_prob = mean(probs, na.rm=TRUE)) %>%
                        ungroup() %>%
                        filter(DIABETES_factor == 1)
  # threshold value                      
  threshold_value <- threshold_value_query$mean_prob
  
  # score test data
  data.s$probs <- predict(my_model, 
                          data.s, 
                          'response')
  
  # use threshold to make prediction  
  data.s <- data.s %>%
    mutate(pred = if_else(probs > threshold_value, 1,0)) %>%
    mutate(pred_factor = as.factor(pred)) %>%
    mutate(pred_factor = relevel(pred_factor, ref='0'))             
  
  # return scored data 
  return(data.s)
}

In the function above we take in two parameters my_model a logistic regression model and my_data an analytical dataset from there we define data.s is the “new data” we wish to make a prediction on and train_data.s is the training data stored within in the model object.

We then follow the steps outlined in the previous chapter to:

  1. Compute a threshold value as the mean probability of the DIABETIC class on the training set.

  2. Use the threshold to make a predicted class on the “new data”.

What does my_model_name <- deparse(substitute(my_model)) do?

Here is the R help for substitute & deparse:

substitute - returns the parse tree for the (unevaluated) expression expr, substituting any variables bound in env.

deparse - Turn unevaluated expressions into character strings.

Let’s explore what happens when we pass our stored model RAND:

Code
RAND

is telling R to parse the expression into an unevaluated expression, RAND; (Note lack of quotes) then

Code
[1] "RAND"

turns the unevaluated expressions into character string the character sting "RAND" (Note existance of quote) that we need for the input in the mutate step for the model = my_model_name.

Similarly,

Code
substitute(INTERACTION)
INTERACTION
Code
deparse(substitute(INTERACTION))
[1] "INTERACTION"

7.11.1 Test Function

It’s always good practice to try and test your function:

Code
logit_model_scorer(RAND , A_DATA.test) %>%
  glimpse()
Rows: 38,219
Columns: 11
$ SEQN            <dbl> 6917, 81122, 43687, 95673, 47615, 42065, 33500, 3053, …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Age             <dbl> 11, 37, 1, 80, 34, 59, 53, 31, 9, 59, 36, 78, 73, 52, …
$ age_bucket      <fct> 3, 7, 1, 10, 6, 9, 8, 6, 3, 9, 6, 10, 10, 8, 10, 3, 2,…
$ Gender          <chr> "Male", "Female", "Female", "Male", "Male", "Female", …
$ rand_class      <fct> Rand_A, Rand_A, Rand_C, Rand_B, Rand_C, Rand_A, Rand_B…
$ model           <chr> "RAND", "RAND", "RAND", "RAND", "RAND", "RAND", "RAND"…
$ probs           <dbl> 0.06972989, 0.06972989, 0.07304059, 0.07036860, 0.0730…
$ pred            <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pred_factor     <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …

We see that our function worked it has successfully run our protocol.

7.11.2 Make Predictions

We can deploy our function as follows:

Code
A_DATA.test.scored.compare <- bind_rows(
  logit_model_scorer(RAND , A_DATA.test) ,
  logit_model_scorer(DISCREATE, A_DATA.test),
  logit_model_scorer(INTERACTION , A_DATA.test)
  )
Code
A_DATA.test.scored.compare %>%
  glimpse()
Rows: 114,657
Columns: 11
$ SEQN            <dbl> 6917, 81122, 43687, 95673, 47615, 42065, 33500, 3053, …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Age             <dbl> 11, 37, 1, 80, 34, 59, 53, 31, 9, 59, 36, 78, 73, 52, …
$ age_bucket      <fct> 3, 7, 1, 10, 6, 9, 8, 6, 3, 9, 6, 10, 10, 8, 10, 3, 2,…
$ Gender          <chr> "Male", "Female", "Female", "Male", "Male", "Female", …
$ rand_class      <fct> Rand_A, Rand_A, Rand_C, Rand_B, Rand_C, Rand_A, Rand_B…
$ model           <chr> "RAND", "RAND", "RAND", "RAND", "RAND", "RAND", "RAND"…
$ probs           <dbl> 0.06972989, 0.06972989, 0.07304059, 0.07036860, 0.0730…
$ pred            <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pred_factor     <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …

7.12 Evaluate models

7.12.1 ROC Curve

Code

Attaching package: 'yardstick'
The following object is masked from 'package:readr':

    spec
Code
levels(A_DATA.test.scored.compare$DIABETES_factor)
[1] "0" "1"

Currently our event-level is the second option, 1!

Code
AUC.DM2_Gender <- A_DATA.test.scored.compare %>%
                    group_by(model) %>%
                    roc_auc(truth = DIABETES_factor, probs, event_level = "second") %>%
                    select(model, .estimate) %>%
                    rename(AUC_estimate = .estimate) %>%
                    mutate(model_AUC = paste0(model, " AUC : ", round(AUC_estimate,4)))

AUC.DM2_Gender
# A tibble: 3 × 3
  model       AUC_estimate model_AUC               
  <chr>              <dbl> <chr>                   
1 DISCREATE          0.849 DISCREATE AUC : 0.8493  
2 INTERACTION        0.845 INTERACTION AUC : 0.8452
3 RAND               0.501 RAND AUC : 0.5013       
Code
A_DATA.test.scored.compare %>%
  left_join(AUC.DM2_Gender) %>%
  group_by(model_AUC) %>%
  roc_curve(truth= DIABETES_factor, probs, event_level = "second") %>%
  autoplot() +
  labs(title = "ROC Curve " ,
       subtitle = 'test set scored on logistic regression')
Joining with `by = join_by(model)`

7.12.2 Precision-Recall Curve

Code
PR_AUC <- A_DATA.test.scored.compare %>%
  group_by(model) %>%
  pr_auc(DIABETES_factor, probs, event_level='second') %>%
  select(model, .estimate) %>%
  rename(PR_AUC_estimate = .estimate) %>%
  mutate(model_PR_AUC = paste0(model, " PR_AUC : ", round(PR_AUC_estimate,4)))

A_DATA.test.scored.compare %>%
  left_join(PR_AUC) %>%
  arrange(PR_AUC_estimate) %>%
  group_by(model_PR_AUC) %>%
  pr_curve(DIABETES_factor, probs, event_level='second') %>%
  autoplot() +
  labs(title = "PR Curve " ,
       subtitle = 'test set scored on logistic regression')
Joining with `by = join_by(model)`

7.12.3 Gain Curves

The gain curve below for instance showcases that we can review the top 25% of probabilities within the DISCREATE and INTERACTION models and we will find roughtly 75% of the diabetic population.

Code
A_DATA.test.scored.compare %>%
  group_by(model) %>%
  gain_curve(truth = DIABETES_factor, probs, event_level = "second") %>%
  autoplot()

7.12.4 Lift Curves

  1. DISCRETE lift curve is nearly monotonically decreasing from 3.5 to about 3.0 in the the top 25% of probabilities in the model.
  2. INTERACTION lift curve appears “the mode is much less than the mean in that normal distribution”, the DISCREATE model only seems to benefit for the top 12.5% of high probability DISCRETE scores, after that graphs are nearly identical.
  3. RAD appears to perform sightly worse than selecting members at random.
Code
A_DATA.test.scored.compare %>%
  group_by(model) %>%
  lift_curve(truth = DIABETES_factor, probs, event_level = "second") %>%
  autoplot()

7.12.5 Confusion Matrix

Code
CM <- A_DATA.test.scored.compare %>%
  group_by(model) %>%
  conf_mat(DIABETES_factor, pred_factor)

CM %>%
  glimpse()
Rows: 3
Columns: 2
$ model    <chr> "DISCREATE", "INTERACTION", "RAND"
$ conf_mat <list> [29606, 5856, 958, 1799], [31429, 4033, 1519, 1238], [28333, …

Here are the three confusion matrices:

Code
CM$conf_mat
[[1]]
          Truth
Prediction     0     1
         0 29606   958
         1  5856  1799

[[2]]
          Truth
Prediction     0     1
         0 31429  1519
         1  4033  1238

[[3]]
          Truth
Prediction     0     1
         0 28333  2209
         1  7129   548

Let’s get heat maps for the 3 confusion matrices, below the map function at work again; recall a typical call to map looks like map(.x, .f, ...) here:

  • .x - A list or atomic vector (CM$conf_mat - list of confusion matrices)

  • .f - A function (autoplot)

  • ... - Additional arguments passed on to the mapped function

    • In this case type = "heatmap" are the additional arguments passed into the function .f = autoplot

Notice also, this is not inside a mutate function as we are not adding any variables to our data-frame. We will discuss other functions from purrr in the next chapter.

Code
map(CM$conf_mat, autoplot, type = "heatmap")
[[1]]


[[2]]


[[3]]

Above we see we get three graphs, one for each of the three confusion matrices.

7.12.6 Model Evaluation Metrics

Code
A_DATA.test.scored.compare %>%
  group_by(model) %>%
  conf_mat(truth = DIABETES_factor, pred_factor) %>%
  mutate(sum_conf_mat = map(conf_mat,summary, event_level = "second")) %>%
  unnest(sum_conf_mat) %>%
  select(model, .metric, .estimate) %>%
  ggplot(aes(x = model, y = .estimate, fill = model)) +
  geom_bar(stat='identity', position = 'dodge') +
  facet_wrap(. ~ .metric) +
  theme(axis.text.x = element_text(angle = 90))