7 Two Factor Classification with Categorical and Continuous Interactions

In this Chapter we will use the same dataset and will only consider non-missing values of Age, Gender, and Diabetic status.

We will follow much the same process of EDA with categorical features and making connections between the p-value of the chi square test and feature importance in logistic regression.

In addition, we will also have a more in-depth discussion around re-leveling logistic regression analysis and the impact it has on model interpret-ability.

This time we will have 3 primary model examples to compare:

Logistic regression with rand_class feature
Logistic regression with binned age and Gender features
Logistic regression with Age interacted with Gender

Code

A_DATA <- readRDS(here::here('DATA/Part_2/A_DATA.RDS'))

library('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

A_DATA <- A_DATA %>%
  select(SEQN, Age, Gender, DIABETES) %>%
  mutate(DIABETES_factor = as.factor(DIABETES)) %>%
  na.omit() 

A_DATA %>%
  glimpse()

Rows: 95,547
Columns: 5
$ SEQN            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ Age             <dbl> 2, 77, 10, 1, 49, 19, 59, 13, 11, 43, 15, 37, 70, 81, …
$ Gender          <chr> "Female", "Male", "Female", "Male", "Male", "Female", …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …

7.1 Bin Age

Additionally, we will create deciles for the Age range:

Code

A_DATA <- A_DATA %>%
  mutate(age_bucket = as.factor(ntile(Age,10)))

A_DATA %>%
  head()

# A tibble: 6 × 6
   SEQN   Age Gender DIABETES DIABETES_factor age_bucket
  <dbl> <dbl> <chr>     <dbl> <fct>           <fct>     
1     1     2 Female        0 0               1         
2     2    77 Male          0 0               10        
3     3    10 Female        0 0               3         
4     4     1 Male          0 0               1         
5     5    49 Male          0 0               8         
6     6    19 Female        0 0               5

Here’s a nice summary table of the various Age buckets we created:

Code

Age_bucket_Table <- A_DATA %>%
  select(Age, age_bucket) %>%
  group_by(age_bucket) %>%
  summarise(min_age = min(Age),
            mean_age = round(mean(Age),2),
            median_age = median(Age),
            max_age = max(Age),
            .groups = 'keep') %>%
  ungroup() %>%
  mutate(age_range = paste0(min_age, ' <= Age <= ', max_age )) %>%
  select(-min_age, -max_age) %>%
  arrange(age_bucket)

Age_bucket_Table %>%
  print(n=10)

# A tibble: 10 × 4
   age_bucket mean_age median_age age_range      
   <fct>         <dbl>      <dbl> <chr>          
 1 1              2.22          2 1 <= Age <= 4  
 2 2              6.52          7 4 <= Age <= 9  
 3 3             11.2          11 9 <= Age <= 13 
 4 4             15.6          16 13 <= Age <= 18
 5 5             21.2          21 18 <= Age <= 26
 6 6             31.0          31 26 <= Age <= 36
 7 7             41.4          41 36 <= Age <= 47
 8 8             52.4          52 47 <= Age <= 59
 9 9             63.9          64 59 <= Age <= 70
10 10            77.3          78 70 <= Age <= 85

7.2 Add noise feature

Additionally, we add a variable rand_class to the data-set. rand_class will be sampled from the options of Rand_A or Rand_B. We will then randomly sort the data, and for every 5th row we will assign Rand_C to rand_class. Our reasons for adding this random feature will be to compare and contrast a random information to information that is linked with the issue at hand.

Code

set.seed(8675309)

A_DATA$rand_class <- sample(c('Rand_A','Rand_B'), 
                            size = nrow(A_DATA),  
                            replace = TRUE)

A_DATA$rand_sort <- runif(nrow(A_DATA))

A_DATA <- A_DATA %>%
  arrange(rand_sort) %>%
  mutate(rn = row_number()) %>%
  mutate(rn_mod_5 = rn %% 5 ) %>% 
  mutate(rand_class = if_else( rn_mod_5 == 0, 
                              "Rand_C", 
                              rand_class)) %>%
  select(-rn, -rn_mod_5, -rand_sort) %>%
  mutate(rand_class = as.factor(rand_class))

Code

A_DATA %>%
  head()

# A tibble: 6 × 7
   SEQN   Age Gender DIABETES DIABETES_factor age_bucket rand_class
  <dbl> <dbl> <chr>     <dbl> <fct>           <fct>      <fct>     
1 83229    10 Male          0 0               3          Rand_A    
2  6917    11 Male          0 0               3          Rand_A    
3  9902    23 Female        0 0               5          Rand_A    
4 81122    37 Female        0 0               7          Rand_A    
5 43687     1 Female        0 0               1          Rand_C    
6 95673    80 Male          0 0               10         Rand_B

7.3 `dplyr` statistics

For categorical data we are typically reviewing counts between various groups:

Code

A_DATA %>%
  group_by(DIABETES, Gender) %>%
  tally() %>%
  pivot_wider(names_from = Gender, values_from = n)

# A tibble: 2 × 3
# Groups:   DIABETES [2]
  DIABETES Female  Male
     <dbl>  <int> <int>
1        0  45188 43552
2        1   3371  3436

Code

A_DATA %>%
  group_by(DIABETES, rand_class) %>%
  tally() %>%
  pivot_wider(names_from = rand_class, values_from = n)

# A tibble: 2 × 4
# Groups:   DIABETES [2]
  DIABETES Rand_A Rand_B Rand_C
     <dbl>  <int>  <int>  <int>
1        0  35434  35580  17726
2        1   2682   2742   1383

7.4 `chisq.test`

Suppose that $n$ observations in a random sample from a population are classified into $k$ mutually exclusive classes with respective observed numbers $O_i$ (for $i = 1, 2, \ldots, k$).

The null hypothesis in the chi-square test assumes $p_i$ is the probability that an observation falls into the $i$-th class.

Therefore, for all $i$, we have an expected frequencies: \[E_{i} = n \cdot p_{i}\]

Note since the $p_i$ are probabilities we have that:

\[\sum_{i=1}^k p_i = 1 \] Since there are $n$ observations we have:

\[ n = n \sum_{i=1}^k p_i = \sum_{i=1}^k n \cdot p_i = \sum_{i=1}^k E_{i} \]

However, we have observed values:

\[ \sum_{i=1}^k O_i = n \]

Therefore, under the null hypothesis of the chi-square we have that:

\[ \sum_{i=1}^k E_{i} = n \sum_{i=1}^kp_i = \sum_{i=1}^kO_i \] For the chi-squared test statistic We define :

\[ \chi^2 \sim \sum_{i=1}^k\frac{(O_i - E_i)^2}{E_i} = n \cdot \sum_{i=1}^k \frac{ \left( \frac{O_i}{n} - p_i \right)^2 }{p_i}\]

where $\chi^2$ is the chi-square distribution with:

$k-1$ degrees of freedom
$O_{i}$ = the number of observations of type $i$.
$E_{i} = n \cdot p_{i}$ = the expected (theoretical) frequency of type $i$, asserted by the null hypothesis that the fraction of type $i$ in the population is $p_{i}$

7.5 `Gender`

In this section we will compare t-tests on Gender between groups of DIABETES.

First we can create a table of Gender and DIABETES:

Code

tbl_DM2_Gender <- table(A_DATA$DIABETES, A_DATA$Gender)

tbl_DM2_Gender

   
    Female  Male
  0  45188 43552
  1   3371  3436

Notice this gives the same result as our dplyr query.

Let’s compare the mean ages between the Diabetic and non-Diabetic populations, here’s a look at the output:

Code

chisq.tbl_DM2_Gender <- table(A_DATA$DIABETES, A_DATA$Gender) |>
  chisq.test()

Code

chisq.tbl_DM2_Gender


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(A_DATA$DIABETES, A_DATA$Gender)
X-squared = 4.8966, df = 1, p-value = 0.02691

let’s review the structure

Code

str(chisq.tbl_DM2_Gender,1)

List of 9
 $ statistic: Named num 4.9
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named int 1
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.0269
 $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
 $ data.name: chr "table(A_DATA$DIABETES, A_DATA$Gender)"
 $ observed : 'table' int [1:2, 1:2] 45188 3371 43552 3436
  ..- attr(*, "dimnames")=List of 2
 $ expected : num [1:2, 1:2] 45100 3459 43640 3348
  ..- attr(*, "dimnames")=List of 2
 $ residuals: 'table' num [1:2, 1:2] 0.417 -1.504 -0.423 1.529
  ..- attr(*, "dimnames")=List of 2
 $ stdres   : 'table' num [1:2, 1:2] 2.23 -2.23 -2.23 2.23
  ..- attr(*, "dimnames")=List of 2
 - attr(*, "class")= chr "htest"

For instance here are the observed:

Code

chisq.tbl_DM2_Gender$observed

   
    Female  Male
  0  45188 43552
  1   3371  3436

Note this is the same as tbl_DM2_Gender

Code

tbl_DM2_Gender == chisq.tbl_DM2_Gender$observed

   
    Female Male
  0   TRUE TRUE
  1   TRUE TRUE

The expected distribution should be:

Code

chisq.tbl_DM2_Gender$expected

   
       Female      Male
  0 45099.539 43640.461
  1  3459.461  3347.539

The residulas are Person residuals and are defined as (observed - expected) / sqrt(expected):

Code

chisq.tbl_DM2_Gender$residuals

   
        Female       Male
  0  0.4165484 -0.4234546
  1 -1.5039980  1.5289337

Code

(chisq.tbl_DM2_Gender$observed - chisq.tbl_DM2_Gender$expected) / sqrt(chisq.tbl_DM2_Gender$expected ) == chisq.tbl_DM2_Gender$residuals

   
    Female Male
  0   TRUE TRUE
  1   TRUE TRUE

7.5.1 p-value review

Review the p-value of the chi-square test between two groups.

The null hypothesis is that each person’s gender is independent of the person’s diabetic status.

If the p-value is

greater than .05 then we accept the null-hypothesis
less than .05 then we reject the null-hypothesis

7.5.2 `broom`

We can tidy-up the results if we utilize the broom package. Recall that broom::tidy tells R to look in the broom package for the tidy function. The tidy function in the broom library converts many base R outputs to tibbles.

Code

tibble_chisq.tbl_DM2_Gender <- broom::tidy(chisq.tbl_DM2_Gender) %>%
  mutate(dist_1 = "Non_DM2.Gender") %>%
  mutate(dist_2 = "DM2.Gender")

the broom::tidy transforms the list return of chisq.test output into a tibble we can also add variable to help us remember where the data came from.

Code

tibble_chisq.tbl_DM2_Gender %>%
  glimpse()

Rows: 1
Columns: 6
$ statistic <dbl> 4.896648
$ p.value   <dbl> 0.02690888
$ parameter <int> 1
$ method    <chr> "Pearson's Chi-squared test with Yates' continuity correctio…
$ dist_1    <chr> "Non_DM2.Gender"
$ dist_2    <chr> "DM2.Gender"

Our current p-value is tibble_chisq.tbl_DM2_Gender$p.value = 0.0269089, since the p-value is less than .05 we reject the null-hypothesis, Gender is not independent of the person’s diabetic status.

7.6 Associated $\chi^2$ Plots

7.6.1 Balloon Plot

This balloon plot shows the frequency table, now the area of the circle is proportional to the frequency counts.

Code

gplots::balloonplot(tbl_DM2_Gender,
                    main ="Balloon Plot for Gender by Diabetes \n Area is proportional to Freq.")

Figure 7.1: Balloon Plot for Gender by Diabetes where area is proportional to Freq.

7.6.2 `vcd::mosaic`

With the vcd::mosaic function you get another “area graph” this time with the groups highlighted by color:

Code

library(vcd)

Loading required package: grid

Code

mosaic(A_DATA$DIABETES ~ A_DATA$Gender,
       gp = shading_max,
       main = "Mosaic plot of Gender by Diabetic Status")

Figure 7.2: Mosaic plot of Gender by Diabetic Status.

In the vcd::mosaic plot:

Blue indicates that the observed value is higher than the expected value if the data were random.
Red color specifies that the observed value is lower than the expected value if the data were random.

For example, from this mosaic plot, it can be seen that the number of Diabetic males is higher than expected.

7.6.3 `corrplot::corrplot`

Code

corrplot::corrplot(chisq.tbl_DM2_Gender$residuals,
                   is.cor = FALSE)

Figure 7.3: graphical display of a correlation matrix we set is.cor = FALSE to tell R this is a general matrix

In the corrplot::corrplot plot:

Positive residuals are in blue. Positive values specify an attraction (positive association) between the corresponding row and column variables. In the image above, we can see there are an association between DIABETES and being a Male.
Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the DIABETICs are negatively associated (or “not associated”) with the Females.

7.7 `rand_class` `chisq.test`

As we did with Gender above, we can extract lists of rand_classs for each type of DIABETES:

Code

tbl_DM2_rand_class <- table(A_DATA$DIABETES, A_DATA$rand_class)
tbl_DM2_rand_class

   
    Rand_A Rand_B Rand_C
  0  35434  35580  17726
  1   2682   2742   1383

Let’s compute chisq.test for rand_class and the Diabetic and non-Diabetic populations:

Code

chisq.test.tbl_DM2_rand_class <- chisq.test(tbl_DM2_rand_class)
chisq.test.tbl_DM2_rand_class


    Pearson's Chi-squared test

data:  tbl_DM2_rand_class
X-squared = 0.86969, df = 2, p-value = 0.6474

Code

tibble_chisq.test.tbl_DM2_rand_class <- broom::tidy(chisq.test.tbl_DM2_rand_class) %>%
  mutate(dist_1 = "Non_DM2.rand_class") %>%
  mutate(dist_2 = "DM2.rand_class")

tibble_chisq.test.tbl_DM2_rand_class %>%
  glimpse()

Rows: 1
Columns: 6
$ statistic <dbl> 0.8696855
$ p.value   <dbl> 0.6473665
$ parameter <int> 2
$ method    <chr> "Pearson's Chi-squared test"
$ dist_1    <chr> "Non_DM2.rand_class"
$ dist_2    <chr> "DM2.rand_class"

7.7.1 plots

Code

gplots::balloonplot(tbl_DM2_rand_class, 
                    main ="Balloon Plot for rand_class by Diabetes \n Area is proportional to Freq.")

Code

mosaic(A_DATA$DIABETES ~ A_DATA$rand_class, 
       gp = shading_max,
       main = "Mosaic plot of rand_class by Diabetic status")

Again, notice the lack of color and the p-value of 0.6473665

Code

corrplot::corrplot(chisq.test.tbl_DM2_rand_class$residuals, 
                   is.cor = FALSE)

We can see there are an association between DIABETES and being in Rand_C, however, it’s not a strong enough to give us a statistically significant chi-square p-value.

Code

bind_rows(tibble_chisq.tbl_DM2_Gender,
          tibble_chisq.test.tbl_DM2_rand_class) %>%
  select(statistic, p.value, dist_1, dist_2)

# A tibble: 2 × 4
  statistic p.value dist_1             dist_2        
      <dbl>   <dbl> <chr>              <chr>         
1     4.90   0.0269 Non_DM2.Gender     DM2.Gender    
2     0.870  0.647  Non_DM2.rand_class DM2.rand_class

We see that the p.value is near 0 for Gender but close to 1 for rand_class, our interpretation is that Gender has a statistically significant impact on Diabetic status whereas rand_class does not.

7.8 Non-Missing Data

Code

A_DATA.no_miss <- A_DATA %>%
  select(SEQN, DIABETES, DIABETES_factor, Age, age_bucket, Gender, rand_class) %>%
  mutate(DIABETES_factor = fct_relevel(DIABETES_factor, c('0','1') ) ) %>%
  na.omit()

7.8.1 ANOVA

Code

aov(DIABETES ~ age_bucket, 
    data = A_DATA.no_miss) %>%
  summary()

               Df Sum Sq Mean Sq F value Pr(>F)    
age_bucket      9    786   87.35    1507 <2e-16 ***
Residuals   95537   5536    0.06                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

aov(DIABETES ~ Gender, 
    data = A_DATA.no_miss) %>%
  summary()

               Df Sum Sq Mean Sq F value Pr(>F)  
Gender          1      0  0.3277   4.953 0.0261 *
Residuals   95545   6322  0.0662                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

aov(DIABETES ~ rand_class, 
    data = A_DATA.no_miss) %>%
  summary()

               Df Sum Sq Mean Sq F value Pr(>F)
rand_class      2      0 0.02877   0.435  0.647
Residuals   95544   6322 0.06617

Code

aov(DIABETES ~ rand_class + age_bucket + Gender, 
    data = A_DATA.no_miss) %>%
  summary()

               Df Sum Sq Mean Sq  F value  Pr(>F)    
rand_class      2      0    0.03    0.497 0.60861    
age_bucket      9    786   87.35 1507.565 < 2e-16 ***
Gender          1      1    0.54    9.284 0.00231 ** 
Residuals   95534   5535    0.06                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.8.2 split data

Code

sample.SEQN <- sample(A_DATA.no_miss$SEQN,
                      size = nrow(A_DATA.no_miss)*.6,
                      replace = FALSE)

Code

A_DATA.train <- A_DATA.no_miss %>%
  filter(SEQN %in% sample.SEQN)

A_DATA.test <- A_DATA.no_miss %>%
  filter(!(SEQN %in% sample.SEQN))

7.9 Train Logistic Regression Models

7.9.1 RAND

Code

RAND <- glm(DIABETES_factor ~ rand_class,
                     data= A_DATA.train,
                     family = "binomial")

7.9.1.1 RAND Model Output

Code

library('broom')

RAND


Call:  glm(formula = DIABETES_factor ~ rand_class, family = "binomial", 
    data = A_DATA.train)

Coefficients:
     (Intercept)  rand_classRand_B  rand_classRand_C  
       -2.590846          0.009805          0.049951  

Degrees of Freedom: 57327 Total (i.e. Null);  57325 Residual
Null Deviance:      29270 
Residual Deviance: 29270    AIC: 29280

Code

summary(RAND)


Call:
glm(formula = DIABETES_factor ~ rand_class, family = "binomial", 
    data = A_DATA.train)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -2.590846   0.025936 -99.893   <2e-16 ***
rand_classRand_B  0.009805   0.036578   0.268    0.789    
rand_classRand_C  0.049951   0.044324   1.127    0.260    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29272  on 57327  degrees of freedom
Residual deviance: 29271  on 57325  degrees of freedom
AIC: 29277

Number of Fisher Scoring iterations: 5

7.9.1.1.1 `RAND` training errors table

Code

RAND %>%
  broom::glance()

# A tibble: 1 × 8
  null.deviance df.null  logLik    AIC    BIC deviance df.residual  nobs
          <dbl>   <int>   <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
1        29272.   57327 -14636. 29277. 29304.   29271.       57325 57328

7.9.1.1.2 `RAND` coefficent table

Code

RAND.CoEff <- RAND %>%
 broom::tidy() %>%
    mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "RAND")

RAND.CoEff %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value	odds_ratio	model
(Intercept)	-2.5908458	0.0259362	-99.8930631	0.0000000	NA	RAND
rand_classRand_B	0.0098048	0.0365777	0.2680539	0.7886578	1.0099	RAND
rand_classRand_C	0.0499514	0.0443244	1.1269487	0.2597642	1.0512	RAND

Notice that when we filter for significant features we find only the intercept:

Code

RAND %>%
 broom::tidy() %>%
  filter(p.value <= .05) %>%
  mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4)))

# A tibble: 1 × 6
  term        estimate std.error statistic p.value odds_ratio
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
1 (Intercept)    -2.59    0.0259     -99.9       0         NA

7.9.2 DISCREATE

Code

A_DATA.train %>%
  group_by(DIABETES_factor, Gender,  rand_class, age_bucket) %>%
  summarise(Pop_N = n_distinct(SEQN)) %>%
  ggplot(aes(x = age_bucket , y = rand_class , size = Pop_N,   color=DIABETES_factor)) +
  geom_point() + 
  facet_grid(DIABETES_factor ~ rand_class + Gender)

`summarise()` has grouped output by 'DIABETES_factor', 'Gender', 'rand_class'.
You can override using the `.groups` argument.

Code

Interactive_DISCREATE <- A_DATA.train %>%
  group_by(DIABETES_factor, Gender,  rand_class, age_bucket) %>%
  summarise(Pop_N = n_distinct(SEQN)) %>%
  ggplot(aes(x = age_bucket , y = rand_class , size = Pop_N,   color=DIABETES_factor)) +
  geom_point() + 
  facet_wrap(~Gender)

`summarise()` has grouped output by 'DIABETES_factor', 'Gender', 'rand_class'.
You can override using the `.groups` argument.

Code

Interactive_DISCREATE

Most evident from the above graphs is the effect the age_bucket has on the number of diabetics, we can see that as the age buckets increase from 1 to 10 the size of the population bubble also increases.

We can run an anova test on these features:

Code

aov(DIABETES ~ rand_class + age_bucket + Gender, 
    data = A_DATA.no_miss) %>%
  summary()

               Df Sum Sq Mean Sq  F value  Pr(>F)    
rand_class      2      0    0.03    0.497 0.60861    
age_bucket      9    786   87.35 1507.565 < 2e-16 ***
Gender          1      1    0.54    9.284 0.00231 ** 
Residuals   95534   5535    0.06                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, our findings here indicated that Gender and age_bucket will have an effect on Diabetes but that rand_class will not.

Code

DISCREATE <- glm( DIABETES_factor ~ age_bucket + Gender,
                     data= A_DATA.train,
                     family = "binomial")

7.9.2.1 DISCREATE Model Output

Let’s take a moment to review the coefficient table for this DISCRETE model:

Code

DISCREATE.coeff <- DISCREATE %>%
  tidy() %>%
  mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "DISCREATE")

DISCREATE.coeff %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value	odds_ratio	model
(Intercept)	-8.0285860	0.7073791	-11.349764	0.0000000	NA	DISCREATE
age_bucket2	1.4060198	0.7907009	1.778194	0.0753720	4.0797	DISCREATE
age_bucket3	2.5648976	0.7349907	3.489701	0.0004836	12.9993	DISCREATE
age_bucket4	2.7260585	0.7304389	3.732083	0.0001899	15.2726	DISCREATE
age_bucket5	3.0445348	0.7242929	4.203458	0.0000263	21.0003	DISCREATE
age_bucket6	4.2239857	0.7125924	5.927632	0.0000000	68.3052	DISCREATE
age_bucket7	5.3082911	0.7091607	7.485315	0.0000000	202.0047	DISCREATE
age_bucket8	6.1521485	0.7081562	8.687558	0.0000000	469.7255	DISCREATE
age_bucket9	6.7720159	0.7078288	9.567308	0.0000000	873.0702	DISCREATE
age_bucket10	6.7696177	0.7078238	9.563987	0.0000000	870.9788	DISCREATE
GenderMale	0.1038955	0.0348196	2.983822	0.0028467	1.1095	DISCREATE

As we expected, as you get older your risk of diabetes, increases.

Warning: Unknown or uninitialised column: `age_rang`.

However, these odds ratios may not be as helpful as we would like them to be for interpretation. For instance, right now we can say that when a patient is in the highest age bucket (70 <= Age <= 85) then the odds of having a diabetes increases by 870.9788 times compared to patients in the lowest age bucket ; so having this as the basis of comparison is not as helpful as it could be.

7.9.2.2 re-leveling factors in the DISCREATE model

Sometimes it is helpful to re-leveling factors in R; within the context of logistic regression we will be able to reset our comparison group so that we have a more even basis of comparison - let’s re-level right on the 36 <= Age <= 47 age bucket:

Code

discrete_releveled <- A_DATA.train %>%
  mutate(age_bucket = fct_relevel(age_bucket,'7'))

now we can fit a new logistic regression

Code

DISCREATE_r <- glm( DIABETES_factor ~ age_bucket + Gender,
                     data= discrete_releveled,
                     family = "binomial")

Code

DISCREATE_r.coeff <- DISCREATE_r %>%
 broom::tidy() %>%
    mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
  mutate(model = "Refactored DISCREATE")

Code

DISCREATE_r.coeff <- Age_bucket_Table %>%
  mutate(term = paste0('age_bucket',age_bucket)) %>%
  left_join(DISCREATE_r.coeff) %>%
  mutate(model = "Refactored DISCREATE")

Joining with `by = join_by(term)`

Code

DISCREATE_r.coeff %>%  
  knitr::kable()

age_bucket	mean_age	median_age	age_range	term	estimate	std.error	statistic	p.value	odds_ratio	model
1	2.22	2	1 <= Age <= 4	age_bucket1	-5.3082911	0.7091607	-7.485315	0	0.0050	Refactored DISCREATE
2	6.52	7	4 <= Age <= 9	age_bucket2	-3.9022713	0.3578434	-10.904970	0	0.0202	Refactored DISCREATE
3	11.18	11	9 <= Age <= 13	age_bucket3	-2.7433936	0.2074982	-13.221286	0	0.0644	Refactored DISCREATE
4	15.64	16	13 <= Age <= 18	age_bucket4	-2.5822326	0.1907514	-13.537160	0	0.0756	Refactored DISCREATE
5	21.24	21	18 <= Age <= 26	age_bucket5	-2.2637563	0.1656540	-13.665573	0	0.1040	Refactored DISCREATE
6	31.02	31	26 <= Age <= 36	age_bucket6	-1.0843054	0.1030945	-10.517590	0	0.3381	Refactored DISCREATE
7	41.43	41	36 <= Age <= 47	age_bucket7	NA	NA	NA	NA	NA	Refactored DISCREATE
8	52.36	52	47 <= Age <= 59	age_bucket8	0.8438574	0.0657799	12.828506	0	2.3253	Refactored DISCREATE
9	63.91	64	59 <= Age <= 70	age_bucket9	1.4637248	0.0621667	23.545157	0	4.3220	Refactored DISCREATE
10	77.32	78	70 <= Age <= 85	age_bucket10	1.4613266	0.0620938	23.534191	0	4.3117	Refactored DISCREATE

Notice that age_bucket7 is missing from the above coefficient table and now we have age_bucket1 this is because we re-factored to level 7, so now all of the odds ratios are computed relative to the age group 36 <= Age <= 47.

When the odds ratio is 1, this means the coefficient in the logistic regression model is ln(1) = 0; meaning changes in the feature has no impact on the outcome.

If a patient is in the highest age bucket, then the odds of having an diabetes increases by 4.3117 times compared to patients aged 36 <= Age <= 47.

In addition to looking for large coefficients or odds ratios we can also review for odds ratios which are between 1 and 0 - these features will have the opposite effect; in that having them reduces the odds of the outcome compared to the baseline group.

So for this example, patient’s in the lowest age bracket decreases their likelihood of having diabetes by 99.5% compared to patients aged 36 <= Age <= 47.

7.9.3 INTERACTION

We can get interaction terms by using * notion:

Code

INTERACTION <- glm(DIABETES_factor  ~ Age + Age*Gender ,
                     data= A_DATA.train,
                     family = "binomial")

7.9.3.1 INTERACTION Coefficent Table

Code

INTERACTION


Call:  glm(formula = DIABETES_factor ~ Age + Age * Gender, family = "binomial", 
    data = A_DATA.train)

Coefficients:
   (Intercept)             Age      GenderMale  Age:GenderMale  
     -5.148106        0.054292       -0.156792        0.004703  

Degrees of Freedom: 57327 Total (i.e. Null);  57324 Residual
Null Deviance:      29270 
Residual Deviance: 23090    AIC: 23100

Code

INTERACTION.CoEff <- INTERACTION %>%
 broom::tidy() %>% 
 mutate(odds_ratio = if_else(term == '(Intercept)', 
                              as.numeric(NA),
                              round(exp(estimate),4))) %>%
 mutate(model= "Interaction")

INTERACTION.CoEff %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value	odds_ratio	model
(Intercept)	-5.1481064	0.0743099	-69.278851	0.0000000	NA	Interaction
Age	0.0542923	0.0011942	45.461946	0.0000000	1.0558	Interaction
GenderMale	-0.1567921	0.1074018	-1.459865	0.1443271	0.8549	Interaction
Age:GenderMale	0.0047026	0.0017205	2.733323	0.0062699	1.0047	Interaction

7.10 Compare Models

7.10.1 Coefficent Compairson

Code

CoEff_Compare <- bind_rows(RAND.CoEff,
          INTERACTION.CoEff,
          DISCREATE.coeff,
          DISCREATE_r.coeff)

CoEff_Compare %>%
  glimpse()

Rows: 28
Columns: 11
$ term       <chr> "(Intercept)", "rand_classRand_B", "rand_classRand_C", "(In…
$ estimate   <dbl> -2.590845847, 0.009804784, 0.049951350, -5.148106438, 0.054…
$ std.error  <dbl> 0.025936194, 0.036577659, 0.044324423, 0.074309928, 0.00119…
$ statistic  <dbl> -99.8930631, 0.2680539, 1.1269487, -69.2788510, 45.4619460,…
$ p.value    <dbl> 0.000000e+00, 7.886578e-01, 2.597642e-01, 0.000000e+00, 0.0…
$ odds_ratio <dbl> NA, 1.0099, 1.0512, NA, 1.0558, 0.8549, 1.0047, NA, 4.0797,…
$ model      <chr> "RAND", "RAND", "RAND", "Interaction", "Interaction", "Inte…
$ age_bucket <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ mean_age   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ median_age <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ age_range  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Code

CoEff_Compare %>%
  mutate(Feature = reorder(term, estimate)) %>%
  ggplot(aes(y=Feature, x=estimate, color=model)) +
  geom_point() +
  geom_errorbarh(aes(xmax = estimate + std.error, xmin = estimate - std.error, height = .2)) +
  ggtitle(expression("Comparing model estimates of " ~ beta[i]))

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_errorbarh()`).

Code

CoEff_Compare %>%
  mutate(Feature = reorder(term, odds_ratio)) %>%
  ggplot(aes(y=Feature, x=odds_ratio, color=model)) +
  geom_point() +
  geom_errorbarh(aes(xmax = exp(estimate + std.error), xmin = exp(estimate - std.error), height = .2)) +                  
  ggtitle(expression("Comparing model estimates of Odds Ratios" ~ exp(beta[i])))

Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_errorbarh()`).

Code

CoEff_Compare %>%
  mutate(Feature = reorder(term, p.value)) %>%
  ggplot(aes( y = Feature, x = p.value , fill=model)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(title = "Comparing model Coefficent p-values")

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

7.10.2 Model Training Errors

Code

model_train_errors <- bind_rows(
  RAND %>% glance() %>% mutate(model = 'RAND'),
  DISCREATE %>% glance() %>% mutate(model = 'DISCREATE'),
  INTERACTION %>% glance() %>% mutate(model = 'INTERACTION')
)

model_train_errors %>%
  select(model, AIC, BIC, logLik, colnames(model_train_errors)) %>%
  arrange(AIC)

# A tibble: 3 × 9
  model      AIC    BIC  logLik null.deviance df.null deviance df.residual  nobs
  <chr>    <dbl>  <dbl>   <dbl>         <dbl>   <int>    <dbl>       <int> <int>
1 DISCRE… 22375. 22474. -11177.        29272.   57327   22353.       57317 57328
2 INTERA… 23102. 23138. -11547.        29272.   57327   23094.       57324 57328
3 RAND    29277. 29304. -14636.        29272.   57327   29271.       57325 57328

7.11 Logit Model Score Helper Function

We have three models (RAND, DISCRETE, & INTERACTION), and in each of the models we need to:

score the training dataset
find a threshold
score the test data, and use threshold to make define predicted classes

If we review the str of the RAND model:

Code

str(RAND,1)

List of 30
 $ coefficients     : Named num [1:3] -2.5908 0.0098 0.05
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "rand_classRand_B" "rand_classRand_C"
 $ residuals        : Named num [1:57328] -1.07 -1.07 -1.07 -1.08 -1.07 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ fitted.values    : Named num [1:57328] 0.0697 0.0697 0.0697 0.0704 0.0697 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ effects          : Named num [1:57328] 158.078 -0.219 1.127 -0.275 -0.271 ...
  ..- attr(*, "names")= chr [1:57328] "(Intercept)" "rand_classRand_B" "rand_classRand_C" "" ...
 $ R                : num [1:3, 1:3] -61.3 0 0 -24.5 30 ...
  ..- attr(*, "dimnames")=List of 2
 $ rank             : int 3
 $ qr               :List of 5
  ..- attr(*, "class")= chr "qr"
 $ family           :List of 13
  ..- attr(*, "class")= chr "family"
 $ linear.predictors: Named num [1:57328] -2.59 -2.59 -2.59 -2.58 -2.59 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ deviance         : num 29271
 $ aic              : num 29277
 $ null.deviance    : num 29272
 $ iter             : int 5
 $ weights          : Named num [1:57328] 0.0649 0.0649 0.0649 0.0654 0.0649 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ prior.weights    : Named num [1:57328] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ df.residual      : int 57325
 $ df.null          : int 57327
 $ y                : Named num [1:57328] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "names")= chr [1:57328] "1" "2" "3" "4" ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  57328 obs. of  2 variables:
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language DIABETES_factor ~ rand_class
  .. .. ..- attr(*, "variables")= language list(DIABETES_factor, rand_class)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. ..- attr(*, "term.labels")= chr "rand_class"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(DIABETES_factor, rand_class)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "factor" "factor"
  .. .. .. ..- attr(*, "names")= chr [1:2] "DIABETES_factor" "rand_class"
 $ call             : language glm(formula = DIABETES_factor ~ rand_class, family = "binomial", data = A_DATA.train)
 $ formula          :Class 'formula'  language DIABETES_factor ~ rand_class
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
 $ terms            :Classes 'terms', 'formula'  language DIABETES_factor ~ rand_class
  .. ..- attr(*, "variables")= language list(DIABETES_factor, rand_class)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. ..- attr(*, "term.labels")= chr "rand_class"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(DIABETES_factor, rand_class)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "factor" "factor"
  .. .. ..- attr(*, "names")= chr [1:2] "DIABETES_factor" "rand_class"
 $ data             : tibble [57,328 × 7] (S3: tbl_df/tbl/data.frame)
  ..- attr(*, "na.action")= 'omit' Named int [1:5769] 19 30 46 48 51 63 99 119 160 173 ...
  .. ..- attr(*, "names")= chr [1:5769] "19" "30" "46" "48" ...
 $ offset           : NULL
 $ control          :List of 3
 $ method           : chr "glm.fit"
 $ contrasts        :List of 1
 $ xlevels          :List of 1
 - attr(*, "class")= chr [1:2] "glm" "lm"

we notice that there is a $data this is the training data stored in the model object RAND

Code

arsenal::comparedf(A_DATA.train, RAND$data)

Compare Object

Function Call: 
arsenal::comparedf(x = A_DATA.train, y = RAND$data)

Shared: 7 non-by variables and 57328 observations.
Not shared: 0 variables and 0 observations.

Differences found in 0/7 variables compared.
0 variables compared have non-identical attributes.

Rather than performing the same operation 3x in a row we will use that fact the training data exists within the glm model output below to functionalize our scoring procedure from the last chapter, we encourage you to review sections Section 6.12.2 and Section 6.12.3; notice the changes made from there to the function below.

We create a function logit_model_scorer as follows:

Code

logit_model_scorer <- function(my_model, my_data){ 
  # extracts models name 
  my_model_name <- deparse(substitute(my_model))
  
  # store model name into a new column called model
  data.s <- my_data %>%
    mutate(model = my_model_name)
  
  # store the training data someplace
  train_data.s <- my_model$data
  
  # score the training data 
  train_data.s$probs <- predict(my_model, 
                                train_data.s, 
                                'response')
  
  # threshold query
  threshold_value_query <- train_data.s %>% 
                        group_by(DIABETES_factor) %>%
                        summarise(mean_prob = mean(probs, na.rm=TRUE)) %>%
                        ungroup() %>%
                        filter(DIABETES_factor == 1)
  # threshold value                      
  threshold_value <- threshold_value_query$mean_prob
  
  # score test data
  data.s$probs <- predict(my_model, 
                          data.s, 
                          'response')
  
  # use threshold to make prediction  
  data.s <- data.s %>%
    mutate(pred = if_else(probs > threshold_value, 1,0)) %>%
    mutate(pred_factor = as.factor(pred)) %>%
    mutate(pred_factor = relevel(pred_factor, ref='0'))             
  
  # return scored data 
  return(data.s)
}

In the function above we take in two parameters my_model a logistic regression model and my_data an analytical dataset from there we define data.s is the “new data” we wish to make a prediction on and train_data.s is the training data stored within in the model object.

We then follow the steps outlined in the previous chapter to:

Compute a threshold value as the mean probability of the DIABETIC class on the training set.
Use the threshold to make a predicted class on the “new data”.

What does my_model_name <- deparse(substitute(my_model)) do?

Here is the R help for substitute & deparse:

substitute - returns the parse tree for the (unevaluated) expression expr, substituting any variables bound in env.

deparse - Turn unevaluated expressions into character strings.

Let’s explore what happens when we pass our stored model RAND:

Code

substitute(RAND)

RAND

is telling R to parse the expression into an unevaluated expression, RAND; (Note lack of quotes) then

Code

deparse(substitute(RAND))

[1] "RAND"

turns the unevaluated expressions into character string the character sting "RAND" (Note existance of quote) that we need for the input in the mutate step for the model = my_model_name.

Similarly,

Code

substitute(INTERACTION)

INTERACTION

Code

deparse(substitute(INTERACTION))

[1] "INTERACTION"

7.11.1 Test Function

It’s always good practice to try and test your function:

Code

logit_model_scorer(RAND , A_DATA.test) %>%
  glimpse()

Rows: 38,219
Columns: 11
$ SEQN            <dbl> 6917, 81122, 43687, 95673, 47615, 42065, 33500, 3053, …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Age             <dbl> 11, 37, 1, 80, 34, 59, 53, 31, 9, 59, 36, 78, 73, 52, …
$ age_bucket      <fct> 3, 7, 1, 10, 6, 9, 8, 6, 3, 9, 6, 10, 10, 8, 10, 3, 2,…
$ Gender          <chr> "Male", "Female", "Female", "Male", "Male", "Female", …
$ rand_class      <fct> Rand_A, Rand_A, Rand_C, Rand_B, Rand_C, Rand_A, Rand_B…
$ model           <chr> "RAND", "RAND", "RAND", "RAND", "RAND", "RAND", "RAND"…
$ probs           <dbl> 0.06972989, 0.06972989, 0.07304059, 0.07036860, 0.0730…
$ pred            <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pred_factor     <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …

We see that our function worked it has successfully run our protocol.

7.11.2 Make Predictions

We can deploy our function as follows:

Code

A_DATA.test.scored.compare <- bind_rows(
  logit_model_scorer(RAND , A_DATA.test) ,
  logit_model_scorer(DISCREATE, A_DATA.test),
  logit_model_scorer(INTERACTION , A_DATA.test)
  )

Code

A_DATA.test.scored.compare %>%
  glimpse()

Rows: 114,657
Columns: 11
$ SEQN            <dbl> 6917, 81122, 43687, 95673, 47615, 42065, 33500, 3053, …
$ DIABETES        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIABETES_factor <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Age             <dbl> 11, 37, 1, 80, 34, 59, 53, 31, 9, 59, 36, 78, 73, 52, …
$ age_bucket      <fct> 3, 7, 1, 10, 6, 9, 8, 6, 3, 9, 6, 10, 10, 8, 10, 3, 2,…
$ Gender          <chr> "Male", "Female", "Female", "Male", "Male", "Female", …
$ rand_class      <fct> Rand_A, Rand_A, Rand_C, Rand_B, Rand_C, Rand_A, Rand_B…
$ model           <chr> "RAND", "RAND", "RAND", "RAND", "RAND", "RAND", "RAND"…
$ probs           <dbl> 0.06972989, 0.06972989, 0.07304059, 0.07036860, 0.0730…
$ pred            <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ pred_factor     <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …

7.12 Evaluate models

7.12.1 ROC Curve

Code

library('yardstick')


Attaching package: 'yardstick'

The following object is masked from 'package:readr':

    spec

Code

levels(A_DATA.test.scored.compare$DIABETES_factor)

[1] "0" "1"

Currently our event-level is the second option, 1!

Code

AUC.DM2_Gender <- A_DATA.test.scored.compare %>%
                    group_by(model) %>%
                    roc_auc(truth = DIABETES_factor, probs, event_level = "second") %>%
                    select(model, .estimate) %>%
                    rename(AUC_estimate = .estimate) %>%
                    mutate(model_AUC = paste0(model, " AUC : ", round(AUC_estimate,4)))

AUC.DM2_Gender

# A tibble: 3 × 3
  model       AUC_estimate model_AUC               
  <chr>              <dbl> <chr>                   
1 DISCREATE          0.849 DISCREATE AUC : 0.8493  
2 INTERACTION        0.845 INTERACTION AUC : 0.8452
3 RAND               0.501 RAND AUC : 0.5013

Code

A_DATA.test.scored.compare %>%
  left_join(AUC.DM2_Gender) %>%
  group_by(model_AUC) %>%
  roc_curve(truth= DIABETES_factor, probs, event_level = "second") %>%
  autoplot() +
  labs(title = "ROC Curve " ,
       subtitle = 'test set scored on logistic regression')

Joining with `by = join_by(model)`

7.12.2 Precision-Recall Curve

Code

PR_AUC <- A_DATA.test.scored.compare %>%
  group_by(model) %>%
  pr_auc(DIABETES_factor, probs, event_level='second') %>%
  select(model, .estimate) %>%
  rename(PR_AUC_estimate = .estimate) %>%
  mutate(model_PR_AUC = paste0(model, " PR_AUC : ", round(PR_AUC_estimate,4)))

A_DATA.test.scored.compare %>%
  left_join(PR_AUC) %>%
  arrange(PR_AUC_estimate) %>%
  group_by(model_PR_AUC) %>%
  pr_curve(DIABETES_factor, probs, event_level='second') %>%
  autoplot() +
  labs(title = "PR Curve " ,
       subtitle = 'test set scored on logistic regression')

Joining with `by = join_by(model)`

7.12.3 Gain Curves

The gain curve below for instance showcases that we can review the top 25% of probabilities within the DISCREATE and INTERACTION models and we will find roughtly 75% of the diabetic population.

Code

A_DATA.test.scored.compare %>%
  group_by(model) %>%
  gain_curve(truth = DIABETES_factor, probs, event_level = "second") %>%
  autoplot()

7.12.4 Lift Curves

DISCRETE lift curve is nearly monotonically decreasing from 3.5 to about 3.0 in the the top 25% of probabilities in the model.
INTERACTION lift curve appears “the mode is much less than the mean in that normal distribution”, the DISCREATE model only seems to benefit for the top 12.5% of high probability DISCRETE scores, after that graphs are nearly identical.
RAD appears to perform sightly worse than selecting members at random.

Code

A_DATA.test.scored.compare %>%
  group_by(model) %>%
  lift_curve(truth = DIABETES_factor, probs, event_level = "second") %>%
  autoplot()

7.12.5 Confusion Matrix

Code

CM <- A_DATA.test.scored.compare %>%
  group_by(model) %>%
  conf_mat(DIABETES_factor, pred_factor)

CM %>%
  glimpse()

Rows: 3
Columns: 2
$ model    <chr> "DISCREATE", "INTERACTION", "RAND"
$ conf_mat <list> [29606, 5856, 958, 1799], [31429, 4033, 1519, 1238], [28333, …

Here are the three confusion matrices:

Code

CM$conf_mat

[[1]]
          Truth
Prediction     0     1
         0 29606   958
         1  5856  1799

[[2]]
          Truth
Prediction     0     1
         0 31429  1519
         1  4033  1238

[[3]]
          Truth
Prediction     0     1
         0 28333  2209
         1  7129   548

Let’s get heat maps for the 3 confusion matrices, below the map function at work again; recall a typical call to map looks like map(.x, .f, ...) here:

.x - A list or atomic vector (CM$conf_mat - list of confusion matrices)
.f - A function (autoplot)
... - Additional arguments passed on to the mapped function
- In this case type = "heatmap" are the additional arguments passed into the function .f = autoplot

Notice also, this is not inside a mutate function as we are not adding any variables to our data-frame. We will discuss other functions from purrr in the next chapter.

Code

map(CM$conf_mat, autoplot, type = "heatmap")

[[1]]


[[2]]


[[3]]

Above we see we get three graphs, one for each of the three confusion matrices.

7.12.6 Model Evaluation Metrics

Code

A_DATA.test.scored.compare %>%
  group_by(model) %>%
  conf_mat(truth = DIABETES_factor, pred_factor) %>%
  mutate(sum_conf_mat = map(conf_mat,summary, event_level = "second")) %>%
  unnest(sum_conf_mat) %>%
  select(model, .metric, .estimate) %>%
  ggplot(aes(x = model, y = .estimate, fill = model)) +
  geom_bar(stat='identity', position = 'dodge') +
  facet_wrap(. ~ .metric) +
  theme(axis.text.x = element_text(angle = 90))

7.1 Bin Age

7.2 Add noise feature

7.3 dplyr statistics

7.4 chisq.test

7.5 Gender

7.5.1 p-value review

7.5.2 broom

7.6 Associated \(\chi^2\) Plots

7.6.1 Balloon Plot

7.6.2 vcd::mosaic

7.6.3 corrplot::corrplot

7.7 rand_class chisq.test

7.7.1 plots

7.8 Non-Missing Data

7.8.1 ANOVA

7.8.2 split data

7.9 Train Logistic Regression Models

7.9.1 RAND

7.9.1.1 RAND Model Output

7.9.1.1.1 RAND training errors table

7.9.1.1.2 RAND coefficent table

7.9.2 DISCREATE

7.9.2.1 DISCREATE Model Output

7.9.2.2 re-leveling factors in the DISCREATE model

7.9.3 INTERACTION

7.9.3.1 INTERACTION Coefficent Table

7.10 Compare Models

7.10.1 Coefficent Compairson

7.10.2 Model Training Errors

7.11 Logit Model Score Helper Function

7.11.1 Test Function

7.11.2 Make Predictions

7.12 Evaluate models

7.12.1 ROC Curve

7.12.2 Precision-Recall Curve

7.12.3 Gain Curves

7.12.4 Lift Curves

7.12.5 Confusion Matrix

7.12.6 Model Evaluation Metrics

7.3 `dplyr` statistics

7.4 `chisq.test`

7.5 `Gender`

7.5.2 `broom`

7.6.2 `vcd::mosaic`

7.6.3 `corrplot::corrplot`

7.7 `rand_class` `chisq.test`

7.9.1.1.1 `RAND` training errors table

7.9.1.1.2 `RAND` coefficent table