10 Random Forest Date/Time Modeling Example

Diabetes date/time-stamped data shall be analyzed using the random forest method. The DIABETES data sets are provided for use in 1994 AI in Medicine symposium submissions (AIM-94 data set provided by Michael Kahn, MD, PhD, Washington University, St. Louis, MO) and may be downloaded from https://archive.ics.uci.edu/ml/datasets/Diabetes. The DIABETES data contain:

Data-Codes: a listing of the codes used in the data sets.
Domain-Description: This file describes the basic physiology and patho-physiology of diabetes mellitus and its treatment.
data-[01-70]: data sets covering several weeks’ to months’ worth of outpatient care on 70 patients.

The folder Diabetes-Data is referenced from https://archive.ics.uci.edu/ml/datasets/Diabetes as per the website

10.1 Diabetes Data Set

Diabetes Data Set Download: Data Folder, Data Set Description

Abstract: This diabetes dataset is from AIM ’94

Characteristics of Diabetes Data Set on https://archive.ics.uci.edu/ml/datasets/Diabetes

Data Set Characteristics: Multivariate, Time-Series Number of Instances: ? Area: Life

Attribute Characteristics: Categorical, Integer Number of Attributes: 20 Date Donated Not Reported

Associated Tasks: Missing Values? ? Number of Web Hits:
561559

as reported on of April 2021

Source: Michael Kahn, MD, PhD, Washington University, St. Louis, MO

Characteristics of Diabetes Data Set on https://archive.ics.uci.edu/ml/datasets/Diabetes
Data Set Characteristics:	Multivariate, Time-Series	Number of Instances:	?	Area:	Life
Attribute Characteristics:	Categorical, Integer	Number of Attributes:	20	Date Donated	Not Reported
Associated Tasks:		Missing Values?	?	Number of Web Hits:	561559 as reported on of April 2021

Data Set Information: Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided “logical time” slots (breakfast, lunch, dinner, bedtime). For paper records, fixed times were assigned to breakfast (08:00), lunch (12:00), dinner (18:00), and bedtime (22:00). Thus paper records have fictitious uniform recording times whereas electronic records have more realistic time stamps.

Diabetes files consist of four fields per record. Each field is separated by a tab and each record is separated by a newline.

File format:

Date in MM-DD-YYYY format

Time in XX:YY format

Code

Value

The Code field is deciphered as follows:

33 = Regular insulin dose

34 = NPH insulin dose

35 = UltraLente insulin dose

48 = Unspecified blood glucose measurement

57 = Unspecified blood glucose measurement

58 = Pre-breakfast blood glucose measurement

59 = Post-breakfast blood glucose measurement

60 = Pre-lunch blood glucose measurement

61 = Post-lunch blood glucose measurement

62 = Pre-supper blood glucose measurement

63 = Post-supper blood glucose measurement

64 = Pre-snack blood glucose measurement

65 = Hypoglycemic symptoms

66 = Typical meal ingestion

67 = More-than-usual meal ingestion

68 = Less-than-usual meal ingestion

69 = Typical exercise activity

70 = More-than-usual exercise activity

71 = Less-than-usual exercise activity

72 = Unspecified special event

11 Download Data

Our first task is to use R to download the data.

The code below creates a temporary directory within the current project directory, downloads a compressed file from a URL to that temporary directory, and then extracts the contents of the compressed file into the same temporary directory.

Code

temp_path <- "Temp_DM_Path"

#This line creates a new directory with the path specified by file.path(here::here(),temp_path). The here::here() function returns the path to the current project directory, and file.path combines this with the value of temp_path to create the full path to the new temporary directory.
dir.create(file.path(here::here(),temp_path)) 
  
my_tar_file <- file.path(here::here(),temp_path,'diabetes-data.tar.Z')

# This line creates a variable my_url and assigns it the value of a URL that points to a compressed file.
my_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/diabetes/diabetes-data.tar.Z'

# This line downloads the file from the URL specified by my_url and saves it to the local path specified by my_tar_file. The mode = "wb" argument specifies that the file should be saved in binary mode.
download.file(my_url, 
              my_tar_file,
              mode = "wb") 

# This line uses the archive_extract function from the archive package to extract the contents of the compressed file specified by my_tar_file into the directory specified by dir= file.path(here::here(),temp_path), which is the same temporary directory created in step 2.
archive::archive_extract(my_tar_file, 
                         dir= file.path(here::here(),temp_path))

11.1 Load Data

Our analysis begins with file path definitions followed by the read_in_text_from_file_no function used to import the diabetes data.

Code

DM2_TS_path <-  here::here("Temp_DM_Path/Diabetes-Data")

We then create a list of file paths

Code

list_of_file_paths <- list.files(path = DM2_TS_path,
                                 pattern = 'data-',
                                 recursive = TRUE,
                                 full.names = TRUE)
head(list_of_file_paths)

[1] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-01"
[2] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-02"
[3] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-03"
[4] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-04"
[5] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-05"
[6] "C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_DM_Path/Diabetes-Data/data-06"

Code

length(list_of_file_paths)

[1] 70

Code

library('stringr')
library('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

tibble_of_file_paths <- tibble(list_of_file_paths) %>%
   mutate(full_path = list_of_file_paths) %>%
   mutate(file_no = str_replace(string = full_path,
                            pattern = paste0(here::here("Temp_DM_Path/Diabetes-Data"),'/data-'),
                            replacement = '')) 

tibble_of_file_paths |>
  head()

# A tibble: 6 × 3
  list_of_file_paths                                           full_path file_no
  <chr>                                                        <chr>     <chr>  
1 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 01     
2 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 02     
3 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 03     
4 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 04     
5 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 05     
6 C:/Users/jkyle/Documents/GitHub/Jeff_R_Data_Wrangling/Temp_… C:/Users… 06

Code

read_in_text_from_file_no <- function(my_file_no){

  enquo_by <- enquo(my_file_no)

  cur_file <- tibble_of_file_paths %>%
    filter(file_no == !!enquo_by)

  cur_path <- as.character(cur_file$full_path)

  dat <- readr::read_tsv(cur_path,
                         col_names = c('Date',"Time","Code","Value"),
                         col_types = cols(col_character(), col_character(), col_character(), col_character())
                         ) %>%
      mutate(file_no = as.numeric(cur_file$file_no))

   return(dat)
}

The goal of using purrr functions instead of for loops is to allow you to break common list manipulation challenges into independent pieces:

How can you solve the problem for a single element of the list? Once you’ve solved that problem, purrr takes care of generalizing your solution to every element in the list.

If you’re solving a complex problem, how can you break it down into bite-sized pieces that allow you to advance one small step towards a solution? With purrr, you get lots of small pieces that you can compose together with the pipe.

This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.

purrr functions are implemented in C making them very fast.

Code

DM2_TIME_RAW <- purrr::map_dfr(tibble_of_file_paths$file_no, # list of columns 
                      function(x){ read_in_text_from_file_no(x) })

Since we have saved the data as an RDS we can remove the temporary files:

Code

unlink(here::here(temp_path), recursive = TRUE)

The data are a matrix (data.frame) of 29,330 rows and 5 columns. The column names are Date (character of mm-dd-yyy), Time (character of h:mm and hh:mm), Code (character of two-digit numbers), Value (character of three-digit numbers), and file_no (number from 1 to 70). The Code number definitions are listed in the R script listing below and are referenced by the Code_Table tibble we define below.

Code

Code_Table <- tribble(
  ~ Code, ~ Code_Description,
  33 , "Regular insulin dose",
  34 , "NPH insulin dose",
  35 , "UltraLente insulin dose",
  48 , "Unspecified blood glucose measurement", # This decode value is the same as 57
  57 , "Unspecified blood glucose measurement", # This decode value is the same as 48
  58 , "Pre-breakfast blood glucose measurement",
  59 , "Post-breakfast blood glucose measurement",
  60 , "Pre-lunch blood glucose measurement",
  61 , "Post-lunch blood glucose measurement",
  62 , "Pre-supper blood glucose measurement",
  63 , "Post-supper blood glucose measurement",
  64 , "Pre-snack blood glucose measurement",
  65 , "Hypoglycemic symptoms",
  66 , "Typical meal ingestion",
  67 , "More-than-usual meal ingestion",
  68 , "Less-than-usual meal ingestion",
  69 , "Typical exercise activity",
  70 , "More-than-usual exercise activity",
  71 , "Less-than-usual exercise activity",
  72 , "Unspecified special event") %>% # What does this mean? 
  mutate(Code_Invalid = if_else(Code %in% c(48,57,72), TRUE, FALSE)) %>%
  mutate(Code_New = if_else(Code_Invalid == TRUE, 999, Code)) 

Code_Table

# A tibble: 20 × 4
    Code Code_Description                         Code_Invalid Code_New
   <dbl> <chr>                                    <lgl>           <dbl>
 1    33 Regular insulin dose                     FALSE              33
 2    34 NPH insulin dose                         FALSE              34
 3    35 UltraLente insulin dose                  FALSE              35
 4    48 Unspecified blood glucose measurement    TRUE              999
 5    57 Unspecified blood glucose measurement    TRUE              999
 6    58 Pre-breakfast blood glucose measurement  FALSE              58
 7    59 Post-breakfast blood glucose measurement FALSE              59
 8    60 Pre-lunch blood glucose measurement      FALSE              60
 9    61 Post-lunch blood glucose measurement     FALSE              61
10    62 Pre-supper blood glucose measurement     FALSE              62
11    63 Post-supper blood glucose measurement    FALSE              63
12    64 Pre-snack blood glucose measurement      FALSE              64
13    65 Hypoglycemic symptoms                    FALSE              65
14    66 Typical meal ingestion                   FALSE              66
15    67 More-than-usual meal ingestion           FALSE              67
16    68 Less-than-usual meal ingestion           FALSE              68
17    69 Typical exercise activity                FALSE              69
18    70 More-than-usual exercise activity        FALSE              70
19    71 Less-than-usual exercise activity        FALSE              71
20    72 Unspecified special event                TRUE              999

Data conditioning, i.e., defining and redefining the data types by column is given in the listing below.

The codes in Code_Table are converted into factor data with code number 72, “Unspecified special event”, removed.

Code

Code_Table_Factor <- Code_Table %>%
  filter(Code_Invalid == FALSE) %>%
  mutate(Code_Factor = factor(paste0("C_",Code_New)))

Code_Table_Factor

# A tibble: 17 × 5
    Code Code_Description                      Code_Invalid Code_New Code_Factor
   <dbl> <chr>                                 <lgl>           <dbl> <fct>      
 1    33 Regular insulin dose                  FALSE              33 C_33       
 2    34 NPH insulin dose                      FALSE              34 C_34       
 3    35 UltraLente insulin dose               FALSE              35 C_35       
 4    58 Pre-breakfast blood glucose measurem… FALSE              58 C_58       
 5    59 Post-breakfast blood glucose measure… FALSE              59 C_59       
 6    60 Pre-lunch blood glucose measurement   FALSE              60 C_60       
 7    61 Post-lunch blood glucose measurement  FALSE              61 C_61       
 8    62 Pre-supper blood glucose measurement  FALSE              62 C_62       
 9    63 Post-supper blood glucose measurement FALSE              63 C_63       
10    64 Pre-snack blood glucose measurement   FALSE              64 C_64       
11    65 Hypoglycemic symptoms                 FALSE              65 C_65       
12    66 Typical meal ingestion                FALSE              66 C_66       
13    67 More-than-usual meal ingestion        FALSE              67 C_67       
14    68 Less-than-usual meal ingestion        FALSE              68 C_68       
15    69 Typical exercise activity             FALSE              69 C_69       
16    70 More-than-usual exercise activity     FALSE              70 C_70       
17    71 Less-than-usual exercise activity     FALSE              71 C_71

Code

library('lubridate')

DM2_TIME <- DM2_TIME_RAW %>%
  mutate(Date_Time_Char = paste0(Date, ' ', Time)) %>%
  mutate(Date_Time = mdy_hm(Date_Time_Char))  %>%
  mutate(Value_Num = as.numeric(Value)) %>%
  mutate(Code = as.numeric(Code)) %>%
  mutate(Record_ID = row_number()) %>%
  left_join(Code_Table_Factor)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Date_Time = mdy_hm(Date_Time_Char)`.
Caused by warning:
!  45 failed to parse.

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Value_Num = as.numeric(Value)`.
Caused by warning:
! NAs introduced by coercion

Joining with `by = join_by(Code)`

Code

DM2_TIME

# A tibble: 29,330 × 13
   Date   Time   Code Value file_no Date_Time_Char Date_Time           Value_Num
   <chr>  <chr> <dbl> <chr>   <dbl> <chr>          <dttm>                  <dbl>
 1 04-21… 9:09     58 100         1 04-21-1991 9:… 1991-04-21 09:09:00       100
 2 04-21… 9:09     33 009         1 04-21-1991 9:… 1991-04-21 09:09:00         9
 3 04-21… 9:09     34 013         1 04-21-1991 9:… 1991-04-21 09:09:00        13
 4 04-21… 17:08    62 119         1 04-21-1991 17… 1991-04-21 17:08:00       119
 5 04-21… 17:08    33 007         1 04-21-1991 17… 1991-04-21 17:08:00         7
 6 04-21… 22:51    48 123         1 04-21-1991 22… 1991-04-21 22:51:00       123
 7 04-22… 7:35     58 216         1 04-22-1991 7:… 1991-04-22 07:35:00       216
 8 04-22… 7:35     33 010         1 04-22-1991 7:… 1991-04-22 07:35:00        10
 9 04-22… 7:35     34 013         1 04-22-1991 7:… 1991-04-22 07:35:00        13
10 04-22… 13:40    33 002         1 04-22-1991 13… 1991-04-22 13:40:00         2
# ℹ 29,320 more rows
# ℹ 5 more variables: Record_ID <int>, Code_Description <chr>,
#   Code_Invalid <lgl>, Code_New <dbl>, Code_Factor <fct>

The result is the raw, imported data in DM2_TIME_RAW are conditioned, assigned to object DM2_TIME, and DM2_TIME_RAW is removed from the computing environment. The conditioning combines the dates in the Date column with the times in the Time are combined into a date time string, Date_Time_Char (mm-dd-yyyy h:mm or hh:mm), which then is converted into the date time format yyyy-mm-dd hh:mm:ss in column Date_Time. The numeric value of the character Value is assigned to column Value_Num. The character form of the codes in Code are redefined as numeric. A new column, Record_ID, is assigned the row number (1 to 29,330).

To ascertain if the complete records are indeed complete, a bar plot showing the percent complete for value number (blue), date and time (green), and code (red) is shown in Figure \(\ref{fig:Per_Complete}\). The abscissa is the percent complete from 0 to 100.

Code

Features_Percent_Complete <- function(data, Percent_Complete = 0){
  
  SumNa <- function(col){sum(is.na(col))}
  
  na_sums <- data %>% 
    summarise_all(SumNa) %>%
    tidyr::pivot_longer(everything() ,names_to = 'feature', values_to = 'SumNa') %>%
    arrange(-SumNa) %>%
    mutate(PctNa = SumNa/nrow(data)) %>%
    mutate(PctComp = (1 - PctNa)*100)
  
  data_out <- na_sums %>%
    filter(PctComp >= Percent_Complete)
  
return(data_out)
}

Code

Feature_Percent_Complete_Graph <- function(data, Percent_Complete = 0 ){
  table <- Features_Percent_Complete(data, Percent_Complete)
  
  plot1 <- table %>%
    mutate(feature = reorder(feature, PctComp)) %>%
    ggplot(aes(x=feature, y =PctComp, fill=feature)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme(legend.position = "none")
  return(plot1)
}

Code

Feature_Percent_Complete_Graph(data = DM2_TIME %>% select(Date_Time, Value_Num, Code_Factor) )

Examination of the data reveal not all records are complete. The section of code identifies these records after joining the data with the scrubbed code table data. The unknown measurement data are assigned to the Unknown_Measurements object.

Code

UnKnown_Measurements <- DM2_TIME %>%
  filter(is.na(Code_Factor) | is.na(Date_Time) | is.na(Value_Num) | Code_Invalid == TRUE | Code_New == 999 )

UnKnown_Measurements

# A tibble: 3,171 × 13
   Date   Time   Code Value file_no Date_Time_Char Date_Time           Value_Num
   <chr>  <chr> <dbl> <chr>   <dbl> <chr>          <dttm>                  <dbl>
 1 04-21… 22:51    48 123         1 04-21-1991 22… 1991-04-21 22:51:00       123
 2 04-24… 22:09    48 340         1 04-24-1991 22… 1991-04-24 22:09:00       340
 3 04-25… 21:54    48 288         1 04-25-1991 21… 1991-04-25 21:54:00       288
 4 04-28… 22:30    48 200         1 04-28-1991 22… 1991-04-28 22:30:00       200
 5 04-29… 22:28    48 081         1 04-29-1991 22… 1991-04-29 22:28:00        81
 6 04-30… 23:06    48 104         1 04-30-1991 23… 1991-04-30 23:06:00       104
 7 05-03… 22:08    48 085         1 05-03-1991 22… 1991-05-03 22:08:00        85
 8 05-04… 21:40    48 063         1 05-04-1991 21… 1991-05-04 21:40:00        63
 9 05-05… 22:18    48 282         1 05-05-1991 22… 1991-05-05 22:18:00       282
10 05-08… 22:45    48 129         1 05-08-1991 22… 1991-05-08 22:45:00       129
# ℹ 3,161 more rows
# ℹ 5 more variables: Record_ID <int>, Code_Description <chr>,
#   Code_Invalid <lgl>, Code_New <dbl>, Code_Factor <fct>

Similarly, complete records are assigned to the Known_Measurements object. The data in this object are plotted (Figure \(\ref{fig:known_measures}\)) with Value_Num on the ordinate and Date_Time (year) on the abscissa. The legend is color coded by Code.

Code

Known_Measurements <- DM2_TIME %>%
  anti_join(UnKnown_Measurements %>% select(Record_ID))

Joining with `by = join_by(Record_ID)`

Code

Known_Measurements

# A tibble: 26,159 × 13
   Date   Time   Code Value file_no Date_Time_Char Date_Time           Value_Num
   <chr>  <chr> <dbl> <chr>   <dbl> <chr>          <dttm>                  <dbl>
 1 04-21… 9:09     58 100         1 04-21-1991 9:… 1991-04-21 09:09:00       100
 2 04-21… 9:09     33 009         1 04-21-1991 9:… 1991-04-21 09:09:00         9
 3 04-21… 9:09     34 013         1 04-21-1991 9:… 1991-04-21 09:09:00        13
 4 04-21… 17:08    62 119         1 04-21-1991 17… 1991-04-21 17:08:00       119
 5 04-21… 17:08    33 007         1 04-21-1991 17… 1991-04-21 17:08:00         7
 6 04-22… 7:35     58 216         1 04-22-1991 7:… 1991-04-22 07:35:00       216
 7 04-22… 7:35     33 010         1 04-22-1991 7:… 1991-04-22 07:35:00        10
 8 04-22… 7:35     34 013         1 04-22-1991 7:… 1991-04-22 07:35:00        13
 9 04-22… 13:40    33 002         1 04-22-1991 13… 1991-04-22 13:40:00         2
10 04-22… 16:56    62 211         1 04-22-1991 16… 1991-04-22 16:56:00       211
# ℹ 26,149 more rows
# ℹ 5 more variables: Record_ID <int>, Code_Description <chr>,
#   Code_Invalid <lgl>, Code_New <dbl>, Code_Factor <fct>

We see the three variables are 100% complete within the Known_Measurements:

Code

Feature_Percent_Complete_Graph(data = Known_Measurements %>% select(Date_Time, Value_Num, Code_Factor) )

Known Measurements Feature % Complete Graph

11.2 View Data

Code

Known_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=Code_Factor)) +
  geom_point()

11.3 UnKnown Measurements

We can take a look at the UnKnown Measurements below, R will give us a warning that there are 86 which either have missing Date_Time or missing Value_Num.

Code

UnKnown_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num)) +
  geom_point()

Warning: Removed 86 rows containing missing values or values outside the scale range
(`geom_point()`).

We will make it our prediction task to classify the UnKnown_Measurements.

11.4 Split Training Data

Code

set.seed(123)
train_records <- sample(Known_Measurements$Record_ID, 
                      size = nrow(Known_Measurements)*.6, 
                      replace = FALSE)


train <- Known_Measurements %>%
  filter(Record_ID %in% train_records)

test <- Known_Measurements %>%
  filter(!(Record_ID %in% train_records))

Code

train %>% 
  ggplot(aes(x=Date_Time, y=Value_Num, color=Code_Factor)) +
  geom_point()

Code

test %>% 
  ggplot(aes(x=Date_Time, y=Value_Num, color=Code_Factor)) +
  geom_point()

11.5 Fit Model 1

Code

library('randomForest')

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

Code

rf_model <- randomForest(Code_Factor ~ Date_Time + Value_Num,
                         data = train,
                         ntree = 659,
                         importance = TRUE)

11.5.1 Random Forest Ouput

Here’s the output of Random Forest

Code

rf_model


Call:
 randomForest(formula = Code_Factor ~ Date_Time + Value_Num, data = train,      ntree = 659, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 659
No. of variables tried at each split: 1

        OOB estimate of  error rate: 36.11%
Confusion matrix:
     C_33 C_34 C_35 C_58 C_59 C_60 C_61 C_62 C_63 C_64 C_65 C_66 C_67 C_68 C_69
C_33 5377  231   59    0    0    0    0    2    0    0    1    1    3    0    0
C_34  669 1586   59    4    0    2    0    1    0    0    0    0    0    0    0
C_35  167   88  398    0    0    0    0    0    0    0    0    0    0    0    0
C_58    0    5    3  969    1  473    6  551   21   79    0    0    0    0    0
C_59    0    0    0    3    0    3    0    5    0    0    0    0    0    0    0
C_60    0   10    1  555    2  534    1  465    9   66    0    0    0    0    0
C_61    0    0    1   12    0    7    4   12    0    4    0    0    0    0    0
C_62    0    8    0  651    2  440    1  680   10   85    0    0    0    0    0
C_63    0    0    0   45    0   25    0   35   26    5    0    0    0    0    0
C_64    0    5    0  137    0   85    0  127    7  173    0    1    0    0    0
C_65    0    0    0    0    0    0    0    0    0    0   47    4  143    0    0
C_66    1    0    0    0    0    0    0    0    0    0    3   66   27    0    0
C_67    1    0    0    0    0    0    0    0    0    0   32    1  167    0    0
C_68    0    0    0    0    0    0    0    0    0    0    3    2   16    0    0
C_69    0    0    0    0    0    0    0    0    0    0    3   11   25    0    0
C_70    0    0    0    0    0    0    0    0    0    0    4    1   85    0    0
C_71    0    0    0    0    0    0    0    0    0    0    8    0   47    0    0
     C_70 C_71 class.error
C_33    0    0  0.05234403
C_34    0    0  0.31667385
C_35    0    0  0.39050536
C_58    0    0  0.54032258
C_59    0    0  1.00000000
C_60    0    0  0.67498478
C_61    0    0  0.90000000
C_62    0    0  0.63771977
C_63    0    0  0.80882353
C_64    0    0  0.67663551
C_65    0    0  0.75773196
C_66    0    0  0.31958763
C_67    0    0  0.16915423
C_68    0    0  1.00000000
C_69    0    0  1.00000000
C_70    0    0  1.00000000
C_71    0    0  1.00000000

The Confusion Matrix that we see above is the confusion matrix from fitting the model against the training data so it is overly optimistic in terms of predictions.

Random Forest is a collection of Decision Trees, we can look at the individual trees in the forest by using one of the two following functions.

Code

library(ggraph)
library(igraph)


Attaching package: 'igraph'

The following objects are masked from 'package:lubridate':

    %--%, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

Code

plot_rf_tree <- function(final_model, tree_num, shorten_label = TRUE) {
  # get tree by index
  tree <- randomForest::getTree(final_model, 
                                k = tree_num, 
                                labelVar = TRUE) %>%
    tibble::rownames_to_column() %>%
    # make leaf split points to NA, so the 0s won't get plotted
    mutate(`split point` = ifelse(is.na(prediction), `split point`, NA))
  
  # prepare data frame for graph
  graph_frame <- data.frame(from = rep(tree$rowname, 2),
                            to = c(tree$`left daughter`, tree$`right daughter`))
  
  # convert to graph and delete the last node that we don't want to plot
  graph <- graph_from_data_frame(graph_frame) %>%
    delete_vertices("0")
  
  # set node labels
  V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`))
  
  if (shorten_label) {
    V(graph)$leaf_label <- substr(as.character(tree$prediction), 1, 1)
    }
  
  V(graph)$split <- as.character(round(tree$`split point`, digits = 2))
  
  # plot
  plot <- ggraph(graph, 'tree') + 
    theme_graph() +
    geom_edge_link() +
    geom_node_point() +
    geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE, 
                    repel = FALSE, colour = "white",
                    show.legend = FALSE)
  
  print(plot)
}

Code

plot_rf_tree(rf_model, 456)

Random Forest models normally come equipped with feature importances which we can access with the varImpPlot function in this instance:

Code

varImpPlot(rf_model)

For instance, above we see that Date_Time is more important in terms classifying the measurement as opposed to Value_Num

We can also extract the feature importances for each of the classes:

Code

caret::varImp(rf_model, scale = FALSE) %>% 
  t() %>%  # transpose the entire dataframe
  as_tibble(rownames = 'Class' ) # turn it back to a tibble

# A tibble: 17 × 3
   Class Date_Time Value_Num
   <chr>     <dbl>     <dbl>
 1 C_33   0.0969     0.517  
 2 C_34   0.264      0.509  
 3 C_35   0.373      0.507  
 4 C_58   0.0481     0.246  
 5 C_59  -0.000723  -0.00106
 6 C_60   0.0413     0.194  
 7 C_61   0.0603     0.0727 
 8 C_62   0.0287     0.205  
 9 C_63   0.118      0.131  
10 C_64   0.174      0.203  
11 C_65   0.0711     0.378  
12 C_66   0.464      0.537  
13 C_67   0.119      0.555  
14 C_68  -0.00104    0.00475
15 C_69   0.00961    0.0252 
16 C_70   0.0133     0.0212 
17 C_71   0.00645    0.00820

The Random Forest model also has a plot associated with it:

Code

plot(rf_model)

Let’s see if we can figure out what is going on with the above graph. We can extract the err.rate and ntree from the model object and place them in a matplot to match the above graph.

Code

err <- rf_model$err.rate

N_trees <- 1:rf_model$ntree

matplot(N_trees , err, type = 'l', xlab="trees", ylab="Error")

derivation of plot of randomForest model output

We can improve this a bit by performing the computations ourselves and keeping track of things:

Code

err2 <- as_tibble(err, rownames='n_trees')

err2 %>%
  glimpse()

Rows: 659
Columns: 19
$ n_trees <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",…
$ OOB     <dbl> 0.3888985, 0.4010195, 0.4031778, 0.4038520, 0.3991778, 0.39185…
$ C_33    <dbl> 0.09125475, 0.09435626, 0.08716876, 0.08636837, 0.08526068, 0.…
$ C_34    <dbl> 0.2818713, 0.3272194, 0.3405500, 0.3448990, 0.3397964, 0.33502…
$ C_35    <dbl> 0.4140625, 0.4623116, 0.4863732, 0.4646840, 0.4636678, 0.45928…
$ C_58    <dbl> 0.6084724, 0.6100235, 0.6091371, 0.5932394, 0.5891270, 0.58295…
$ C_59    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ C_60    <dbl> 0.6745363, 0.6883768, 0.7014563, 0.6958763, 0.6922034, 0.68802…
$ C_61    <dbl> 0.8823529, 0.8695652, 0.9062500, 0.9166667, 0.9166667, 0.94444…
$ C_62    <dbl> 0.6921986, 0.6770186, 0.6747851, 0.6957352, 0.6849882, 0.67257…
$ C_63    <dbl> 0.8837209, 0.8354430, 0.8446602, 0.8547009, 0.8412698, 0.85937…
$ C_64    <dbl> 0.7313433, 0.7507692, 0.7500000, 0.7707424, 0.7479508, 0.76247…
$ C_65    <dbl> 0.5774648, 0.6504065, 0.4697987, 0.4756098, 0.3815029, 0.47777…
$ C_66    <dbl> 0.2564103, 0.3000000, 0.5769231, 0.6470588, 0.5617978, 0.48351…
$ C_67    <dbl> 0.5000000, 0.3931624, 0.6466667, 0.6190476, 0.6032609, 0.53968…
$ C_68    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ C_69    <dbl> 0.9333333, 0.9166667, 0.9354839, 0.9696970, 1.0000000, 0.97142…
$ C_70    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ C_71    <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1.0000000, 0.98076…

Code

predicted_levels <- colnames(err2)[!colnames(err2) %in% c('n_trees','OOB')]
predicted_levels

 [1] "C_33" "C_34" "C_35" "C_58" "C_59" "C_60" "C_61" "C_62" "C_63" "C_64"
[11] "C_65" "C_66" "C_67" "C_68" "C_69" "C_70" "C_71"

Code

plot <- err2 %>%
  select(-OOB) %>%
  pivot_longer(all_of(predicted_levels), names_to = "Class", values_to = "Error") %>%
  mutate(N_Trees = as.numeric(n_trees)) %>%
  ggplot(aes(x=N_Trees, y=Error, color=Class)) +
  geom_point() +
  geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE) 
  
plot

Improved plot of base randomForest model output

Code

better_rand_forest_errors <- function(rf_model_obj){
  err <- rf_model_obj$err.rate
  
  err2 <- as_tibble(err, rownames='n_trees')
  
  predicted_levels <- colnames(err2)[!colnames(err2) %in% c('n_trees','OOB')]
  
  plot <- err2 %>%
  select(-OOB) %>%
  pivot_longer(all_of(predicted_levels), names_to = "Class", values_to = "Error") %>%
  mutate(N_Trees = as.numeric(n_trees)) %>%
  ggplot(aes(x=N_Trees, y=Error, color=Class)) +
  geom_point() +
  geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE) 
  
  return(plot)
}

11.6 Evaluate Model Performace

Code

test_Scored <- test %>%
  mutate(model = 'base')

First, we will get the expected classes for the test data:

Code

test_Scored$pred <- predict(rf_model, test , 'response')

test_Scored %>%
  glimpse()

Rows: 10,464
Columns: 15
$ Date             <chr> "04-21-1991", "04-21-1991", "04-22-1991", "04-22-1991…
$ Time             <chr> "9:09", "9:09", "7:35", "7:35", "7:25", "7:29", "7:29…
$ Code             <dbl> 58, 34, 58, 34, 58, 58, 33, 34, 33, 62, 33, 62, 33, 3…
$ Value            <chr> "100", "013", "216", "013", "257", "067", "009", "014…
$ file_no          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Date_Time_Char   <chr> "04-21-1991 9:09", "04-21-1991 9:09", "04-22-1991 7:3…
$ Date_Time        <dttm> 1991-04-21 09:09:00, 1991-04-21 09:09:00, 1991-04-22…
$ Value_Num        <dbl> 100, 13, 216, 13, 257, 67, 9, 14, 2, 228, 7, 256, 8, …
$ Record_ID        <int> 1, 3, 7, 9, 13, 25, 26, 27, 32, 37, 38, 42, 43, 45, 5…
$ Code_Description <chr> "Pre-breakfast blood glucose measurement", "NPH insul…
$ Code_Invalid     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ Code_New         <dbl> 58, 34, 58, 34, 58, 58, 33, 34, 33, 62, 33, 62, 33, 3…
$ Code_Factor      <fct> C_58, C_34, C_58, C_34, C_58, C_58, C_33, C_34, C_33,…
$ model            <chr> "base", "base", "base", "base", "base", "base", "base…
$ pred             <fct> C_60, C_35, C_58, C_35, C_62, C_63, C_33, C_33, C_33,…

Code

library('yardstick')


Attaching package: 'yardstick'

The following object is masked from 'package:readr':

    spec

Code

test_Scored %>% 
  conf_mat(truth = Code_Factor, pred) %>%
  autoplot('mosaic')

Code

test_Scored %>%
  conf_mat(truth= Code_Factor, pred) %>%
  summary() %>%
  ggplot(aes(x=.metric, y=.estimate, fill= .metric)) +
  geom_bar(stat='identity', position = 'dodge') +
  coord_flip()

Warning: While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'C_68': 13, 'C_69': 29, 'C_70': 49, 'C_71': 43
While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'C_68': 13, 'C_69': 29, 'C_70': 49, 'C_71': 43

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

Code

test_Scored %>%
  mutate(Predict_Correct = if_else(Code_Factor == pred, TRUE, FALSE)) %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=Predict_Correct)) +
  geom_point()

Graph of Correct Predictions on test data under base model

11.7 Correlated Probabilites

First, let’s review the probabilities:

Code

Probs <- predict(rf_model, test_Scored, 'prob') %>%
  as_tibble() 

Probs

# A tibble: 10,464 × 17
   C_33  C_34  C_35  C_58  C_59  C_60  C_61  C_62  C_63  C_64  C_65  C_66  C_67 
   <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat> <mat>
 1 0.00… 0.00… 0.00… 0.02… 0     0.55… 0     0.37… 0.00… 0.03… 0     0     0    
 2 0.25… 0.00… 0.74… 0.00… 0     0.00… 0     0.00… 0.00… 0.00… 0     0     0    
 3 0.00… 0.00… 0.00… 0.69… 0     0.16… 0     0.13… 0.00… 0.00… 0     0     0    
 4 0.25… 0.01… 0.73… 0.00… 0     0.00… 0     0.00… 0.00… 0.00… 0     0     0    
 5 0.00… 0.00… 0.00… 0.31… 0     0.20… 0     0.42… 0.00… 0.05… 0     0     0    
 6 0.00… 0.00… 0.00… 0.15… 0     0.28… 0     0.21… 0.28… 0.05… 0     0     0    
 7 0.98… 0.01… 0.00… 0.00… 0     0.00… 0     0.00… 0.00… 0.00… 0     0     0    
 8 0.71… 0.27… 0.01… 0.00… 0     0.00… 0     0.00… 0.00… 0.00… 0     0     0    
 9 0.98… 0.01… 0.00… 0.00… 0     0.00… 0     0.00… 0.00… 0.00… 0     0     0    
10 0.00… 0.00… 0.00… 0.43… 0     0.37… 0     0.04… 0.00… 0.14… 0     0     0    
# ℹ 10,454 more rows
# ℹ 4 more variables: C_68 <matrix>, C_69 <matrix>, C_70 <matrix>,
#   C_71 <matrix>

Code

test_Scored <- cbind(test_Scored, Probs)

Code

GGally::ggcorr(Probs,
               method = c("pairwise", "pearson"),
               label = TRUE)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Code

library('corrr')
corr_data <- correlate(Probs,
  use = "pairwise.complete.obs",
  method = "pearson",)

Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

Code

corr_data %>%
  glimpse()

Rows: 17
Columns: 18
$ term <chr> "C_33", "C_34", "C_35", "C_58", "C_59", "C_60", "C_61", "C_62", "…
$ C_33 <dbl> NA, -0.19163898, -0.10441614, -0.54956761, -0.05396652, -0.511840…
$ C_34 <dbl> -0.19163898, NA, 0.04132957, -0.28371748, -0.02844334, -0.2606538…
$ C_35 <dbl> -0.10441614, 0.04132957, NA, -0.14865214, -0.01483954, -0.1393353…
$ C_58 <dbl> -0.54956761, -0.28371748, -0.14865214, NA, 0.03561936, 0.36949568…
$ C_59 <dbl> -0.053966522, -0.028443345, -0.014839537, 0.035619361, NA, 0.0124…
$ C_60 <dbl> -0.51184049, -0.26065388, -0.13933531, 0.36949568, 0.01240024, NA…
$ C_61 <dbl> -0.097743416, -0.050999090, -0.023185268, 0.081489225, -0.0013966…
$ C_62 <dbl> -0.54217945, -0.27453263, -0.14798142, 0.39238423, 0.03940021, 0.…
$ C_63 <dbl> -0.164511288, -0.086079561, -0.045169014, 0.105786617, 0.01077281…
$ C_64 <dbl> -0.29676777, -0.14955872, -0.08073562, 0.18846421, 0.05790872, 0.…
$ C_65 <dbl> -0.16073437, -0.08587324, -0.04520394, -0.12428883, -0.01218820, …
$ C_66 <dbl> -0.066317537, -0.038630694, -0.020294164, -0.056104543, -0.005503…
$ C_67 <dbl> -0.16848325, -0.09130119, -0.04765189, -0.13115047, -0.01286117, …
$ C_68 <dbl> -0.029782995, -0.015318848, -0.010093123, -0.027725505, -0.002723…
$ C_69 <dbl> -0.042576706, -0.026037345, -0.013980500, -0.038633843, -0.003788…
$ C_70 <dbl> -0.108371422, -0.059014542, -0.030955034, -0.085714612, -0.008407…
$ C_71 <dbl> -0.083580597, -0.048190858, -0.026590249, -0.073846899, -0.007241…

Code

test_Scored %>%
  conf_mat(truth = Code_Factor, pred) %>%
  autoplot('heatmap')

One of the things to note above is that classes C_68, C_69, C_70, C_71 are never predicted.

Likewise the probabilities of C_65, C_70, and C_67 are highly correlated:

Code

corr_data %>%
   filter_at(vars(all_of(colnames(Probs))), any_vars( . > .6 ))

# A tibble: 3 × 18
  term    C_33    C_34    C_35    C_58     C_59    C_60    C_61    C_62    C_63
  <chr>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 C_65  -0.161 -0.0859 -0.0452 -0.124  -0.0122  -0.116  -0.0221 -0.123  -0.0372
2 C_67  -0.168 -0.0913 -0.0477 -0.131  -0.0129  -0.122  -0.0233 -0.129  -0.0392
3 C_70  -0.108 -0.0590 -0.0310 -0.0857 -0.00841 -0.0798 -0.0152 -0.0846 -0.0256
# ℹ 8 more variables: C_64 <dbl>, C_65 <dbl>, C_66 <dbl>, C_67 <dbl>,
#   C_68 <dbl>, C_69 <dbl>, C_70 <dbl>, C_71 <dbl>

And there are a number of classes which do not have many training examples:

Code

train %>%
  group_by(Code_Factor) %>%
  tally() %>%
  ungroup() %>%
  mutate(n = as.numeric(n)) %>%
  full_join(Code_Table_Factor) %>%
  mutate(n = if_else(is.na(n) , 0 , n)) %>%
  arrange(n) %>%
  print(n=20)

Joining with `by = join_by(Code_Factor)`

# A tibble: 17 × 6
   Code_Factor     n  Code Code_Description                Code_Invalid Code_New
   <fct>       <dbl> <dbl> <chr>                           <lgl>           <dbl>
 1 C_59           11    59 Post-breakfast blood glucose m… FALSE              59
 2 C_68           21    68 Less-than-usual meal ingestion  FALSE              68
 3 C_69           39    69 Typical exercise activity       FALSE              69
 4 C_61           40    61 Post-lunch blood glucose measu… FALSE              61
 5 C_71           55    71 Less-than-usual exercise activ… FALSE              71
 6 C_70           90    70 More-than-usual exercise activ… FALSE              70
 7 C_66           97    66 Typical meal ingestion          FALSE              66
 8 C_63          136    63 Post-supper blood glucose meas… FALSE              63
 9 C_65          194    65 Hypoglycemic symptoms           FALSE              65
10 C_67          201    67 More-than-usual meal ingestion  FALSE              67
11 C_64          535    64 Pre-snack blood glucose measur… FALSE              64
12 C_35          653    35 UltraLente insulin dose         FALSE              35
13 C_60         1643    60 Pre-lunch blood glucose measur… FALSE              60
14 C_62         1877    62 Pre-supper blood glucose measu… FALSE              62
15 C_58         2108    58 Pre-breakfast blood glucose me… FALSE              58
16 C_34         2321    34 NPH insulin dose                FALSE              34
17 C_33         5674    33 Regular insulin dose            FALSE              33

To aide us further, we will perform a k-means clustering on the probabilities to get a sense of how many classes we should have. Our first step in this endeavor will be to create a sample of roughly 60 % of the available probabilities:

Code

Probs_sample <- Probs %>%
  rowwise() %>%
  mutate(rand_num = runif(1)) %>%
  ungroup() %>%
  filter(rand_num > .4) %>%
  select(-rand_num) # this should sample roughly 60% of the Probs data 

nrow(Probs_sample)/nrow(Probs)

[1] 0.5952791

Code

library('broom')
proc.kmeans <- function(data, k = 1:9){

kclusts <-
  tibble(k = k) %>%
  mutate(
    kclust = map(k, ~kmeans(data, .x)),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, data)
  )

kmeans_model <- function(final_k = 5){
   (kclusts %>%
     filter(k == final_k))$kclust[[1]]
}

id_vars <- colnames(kclusts)

cluster_assignments <- kclusts %>%
  unnest(cols = augmented)

vars <- setdiff(colnames(cluster_assignments), c(id_vars,'.cluster'))

within_cluster_variation_plot <- kclusts %>%
  unnest(cols = c(glanced)) %>%
  ggplot(aes(k, tot.withinss))  +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = max(k))) +
  labs(title = "Within Cluster Variation versus number of Clusters")

kmeans.pca <- cluster_assignments %>%
  select(all_of(vars)) %>%
  prcomp()

cluster_assignments_plot <- kmeans.pca %>%
  augment(cluster_assignments) %>%
  ggplot( aes(x = .fittedPC1, y = .fittedPC2, color = .cluster )) +
  geom_point(alpha = 0.8) +
  facet_wrap(~ k) +
  labs(title = "k-means clusterings")

return(list(cluster_assignments_plot= cluster_assignments_plot,
            within_cluster_variation_plot=within_cluster_variation_plot,
            kmeans_model = kmeans_model))
}

Code

proc.kmeans.results <- proc.kmeans(Probs_sample, k=3:11)
proc.kmeans.results

$cluster_assignments_plot


$within_cluster_variation_plot


$kmeans_model
function(final_k = 5){
   (kclusts %>%
     filter(k == final_k))$kclust[[1]]
}
<bytecode: 0x0000021c32f71438>
<environment: 0x0000021c284025c0>

Code

proc.kmeans.results$kmeans_model(6) %>%
  augment(Probs_sample) %>%
  rownames_to_column("id") %>%
  pivot_longer(cols = c(-.cluster, -id), names_to = "class", values_to = "prob") %>%
  mutate(prob = as.numeric(prob)) %>%
  arrange(desc(prob)) %>%
  ggplot(aes(x=class, y=prob, color = .cluster)) +
  geom_point() +
  facet_wrap(~.cluster)

Code

proc.kmeans.results$kmeans_model(6) %>%
  augment(Probs_sample) %>%
  rownames_to_column("id") %>%
  pivot_longer(cols = c(-.cluster, -id), names_to = "class", values_to = "prob") %>%
  mutate(prob = as.numeric(prob)) %>%
  arrange(desc(prob)) %>%
  filter(.cluster == 5) %>%
  ggplot(aes(x=class, y=prob, color = .cluster)) +
  geom_point()

Code

rm(proc.kmeans.results)

saveRDS(rf_model, here::here("DATA/Part_5/models/rf_model.RDS"))

rm(rf_model)

11.8 Make Adjustments

Our previous model has some issues, which we can attempt to correct.

Code

Code_Table_Factor_2 <- Code_Table_Factor %>%
  mutate(Code_Description_2 = case_when(
    Code %in% c(33, 34, 35) ~ "insulin dose",
    Code %in% c(58, 60, 62, 64) ~ "pre-meal",
    Code %in% c(59, 61, 63) ~ "post-meal",
    Code %in% c(66, 67, 68, 65) ~ "meal ingestion / Hypoglycemic",
    Code %in% c(69, 70, 71) ~ "exercise activity")
    ) %>%
  mutate(Code_Description_2 = factor(Code_Description_2)) %>%
  select(Code, Code_Description_2)

Code_Table_Factor_2

# A tibble: 17 × 2
    Code Code_Description_2           
   <dbl> <fct>                        
 1    33 insulin dose                 
 2    34 insulin dose                 
 3    35 insulin dose                 
 4    58 pre-meal                     
 5    59 post-meal                    
 6    60 pre-meal                     
 7    61 post-meal                    
 8    62 pre-meal                     
 9    63 post-meal                    
10    64 pre-meal                     
11    65 meal ingestion / Hypoglycemic
12    66 meal ingestion / Hypoglycemic
13    67 meal ingestion / Hypoglycemic
14    68 meal ingestion / Hypoglycemic
15    69 exercise activity            
16    70 exercise activity            
17    71 exercise activity

Code

train_2 <- train %>% left_join(Code_Table_Factor_2)

Joining with `by = join_by(Code)`

Code

rf_model_2 <- randomForest(Code_Description_2 ~ Date_Time + Value_Num,
                         data = train_2,
                         ntree = 659,
                         importance = TRUE)

Code

better_rand_forest_errors(rf_model_2)

Code

test_Scored2 <- test %>%
  left_join(Code_Table_Factor_2) %>% 
  mutate(model = 'new')

Joining with `by = join_by(Code)`

Code

test_Scored2$pred <- predict(rf_model_2, test_Scored2 , 'response')

conf_mat.test_Scored2 <- test_Scored2 %>%
  conf_mat(truth = Code_Description_2, pred)

conf_mat.test_Scored2 %>%
  autoplot('heatmap')

Code

saveRDS(rf_model_2, here::here('DATA/Part_5/models/rf_model_2.RDS'))

Code

base <- test_Scored %>% conf_mat(truth= Code_Factor, pred) %>% summary() %>% mutate(model = 'base')

Warning: While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'C_68': 13, 'C_69': 29, 'C_70': 49, 'C_71': 43
While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'C_68': 13, 'C_69': 29, 'C_70': 49, 'C_71': 43

Code

new <- test_Scored2 %>% conf_mat(truth= Code_Description_2, pred) %>% summary() %>% mutate(model = 'new')

Warning: While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'exercise activity': 121

Warning: While computing multiclass `precision()`, some levels had no predicted events
(i.e. `true_positive + false_positive = 0`).
Precision is undefined in this case, and those levels will be removed from the
averaged result.
Note that the following number of true events actually occurred for each
problematic event level:
'exercise activity': 121

Code

comp <- bind_rows(base,
                  new)

comp %>%
  ggplot(aes(x=model, y=.estimate, fill= model)) +
  geom_bar(stat='identity', position = 'dodge') +
  coord_flip() +
  facet_wrap(.metric ~ .)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Model Evaluation Metrics - base Vs new model

11.9 Make Adjustments, Again

Let’s make another pass at this, this time we will group together

Code

Code_Table_Factor_3 <- Code_Table_Factor %>%
  mutate(Code_Description_3 = case_when(
    Code %in% c(33, 34, 35) ~ "insulin dose",
    Code %in% c(58, 60, 62, 64, 59, 61, 63) ~ "pre / post meal",
    Code %in% c(66, 67, 68, 65, 69, 70, 71) ~ "meal ingestion / Hypoglycemic / exercise activity")) %>%
  mutate(Code_Description_3 = factor(Code_Description_3)) %>%
  select(Code, Code_Description_3)

Code_Table_Factor_3

# A tibble: 17 × 2
    Code Code_Description_3                               
   <dbl> <fct>                                            
 1    33 insulin dose                                     
 2    34 insulin dose                                     
 3    35 insulin dose                                     
 4    58 pre / post meal                                  
 5    59 pre / post meal                                  
 6    60 pre / post meal                                  
 7    61 pre / post meal                                  
 8    62 pre / post meal                                  
 9    63 pre / post meal                                  
10    64 pre / post meal                                  
11    65 meal ingestion / Hypoglycemic / exercise activity
12    66 meal ingestion / Hypoglycemic / exercise activity
13    67 meal ingestion / Hypoglycemic / exercise activity
14    68 meal ingestion / Hypoglycemic / exercise activity
15    69 meal ingestion / Hypoglycemic / exercise activity
16    70 meal ingestion / Hypoglycemic / exercise activity
17    71 meal ingestion / Hypoglycemic / exercise activity

Code

train_3 <- train %>% left_join(Code_Table_Factor_3)

Joining with `by = join_by(Code)`

Code

rf_model_3 <- randomForest(Code_Description_3 ~ Date_Time + Value_Num,
                         data = train_3,
                         ntree = 659,
                         importance = TRUE)

Code

rm(train_3)

Code

better_rand_forest_errors(rf_model_3)

Code

test_Scored3 <- test %>%
  left_join(Code_Table_Factor_3) %>% 
  mutate(model = 'final')

Joining with `by = join_by(Code)`

Code

test_Scored3$pred <- predict(rf_model_3, test_Scored3 , 'response')

conf_mat.test_Scored3 <- test_Scored3 %>%
  conf_mat(truth = Code_Description_3, pred)

conf_mat.test_Scored3 %>%
  autoplot('heatmap')

Code

saveRDS(rf_model_3, here::here('DATA/Part_5/models/rf_model_3.RDS'))
rm(rf_model_3)

11.10 Extract More Data

Code

library('ggplot2')
library('ggdendro')
library('tree')


Attaching package: 'tree'

The following object is masked from 'package:igraph':

    tree

Code

tree_1 <- tree::tree(Code_Factor ~ Date_Time + Value_Num, 
              data = train)

Warning in tree::tree(Code_Factor ~ Date_Time + Value_Num, data = train): NAs
introduced by coercion

Code

plot_dendro_tree <- function(model){

  tree_data <- dendro_data(model)

p <- segment(tree_data) %>%
  ggplot() +
  geom_segment(aes(x = x, 
                   y = y, 
                   xend = xend, 
                   yend = yend, 
                   size = n), 
               colour = "blue", alpha = 0.5) +
  scale_size("n") +
  geom_text(data = label(tree_data), 
            aes(x = x, y = y, label = label), vjust = -0.5, size = 3) +
  geom_text(data = leaf_label(tree_data), 
            aes(x = x, y = y, label = label), vjust = 0.5, size = 2) +
  theme_dendro()
p
}

plot_dendro_tree(tree_1)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Code

tree_2 <- tree::tree(Code_Factor ~ Date_Time + Value_Num + h + day + month , 
              data = train_2 %>%
                mutate(h = factor(lubridate::hour(Date_Time))) %>%
                select(Code_Factor, Date_Time, Value_Num, h) %>%
                mutate(day = lubridate::day(Date_Time)) %>%
                mutate(month = month(Date_Time, label = TRUE)) 
)

Warning in tree::tree(Code_Factor ~ Date_Time + Value_Num + h + day + month, :
NAs introduced by coercion

Code

plot_dendro_tree(tree_2)

Code

rf_model_4 <- randomForest(Code_Description_2 ~ Date_Time + Value_Num  + h + day + month,
                         data = train_2 %>%
                           mutate(h = factor(lubridate::hour(Date_Time))) %>%
                           select(Code_Description_2, Code_Factor, Date_Time, Value_Num, h) %>%
                           mutate(day = lubridate::day(Date_Time)) %>%
                           mutate(month = month(Date_Time, label = TRUE)),
                         ntree = 659,
                         importance = TRUE)

Code

better_rand_forest_errors(rf_model_4)

Code

test_Scored4 <- test %>%
  left_join(Code_Table_Factor_2) %>% 
  mutate(model = 'final') %>%
  mutate(h = factor(lubridate::hour(Date_Time))) %>%
  mutate(day = lubridate::day(Date_Time)) %>%
  mutate(month = month(Date_Time, label = TRUE)) %>%
  select(Code_Description_2, Code_Factor, Date_Time, Value_Num, day, month, h)

Joining with `by = join_by(Code)`

Code

test_Scored4$pred <- predict(rf_model_4, test_Scored4 , 'response')

conf_mat.test_Scored4 <- test_Scored4 %>%
  conf_mat(truth = Code_Description_2, pred)

Code

slim <- conf_mat.test_Scored3 %>% summary() %>% mutate(model = 'slim')
final <- conf_mat.test_Scored4 %>% summary() %>% mutate(model = 'final')

comp_f <- bind_rows(comp,
                    slim,
                    final)

comp_f %>%
  ggplot(aes(x=model, y=.estimate, fill = model)) +
  geom_bar(stat='identity', position = 'dodge') + 
  coord_flip() +
  facet_wrap(.metric ~ .)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Model Evaluation Metrics - base Vs new model Vs Final

11.11 Predict UnKnown Data

Code

keep = c('UnKnown_Measurements','rf_model','rf_model_2','rf_model_3','rf_model_4','keep')

rm(list = setdiff(ls(), keep))

rf_model <- readRDS(here::here('DATA/Part_5/models/rf_model.RDS'))
rf_model_2 <- readRDS(here::here('DATA/Part_5/models/rf_model_2.RDS'))
rf_model_3 <- readRDS(here::here('DATA/Part_5/models/rf_model_3.RDS'))

Code

UnKnown_Measurements$pred_base <- predict(rf_model, UnKnown_Measurements , 'response')
UnKnown_Measurements$pred_new <- predict(rf_model_2, UnKnown_Measurements , 'response')
UnKnown_Measurements$pred_slim <- predict(rf_model_3, UnKnown_Measurements , 'response')

UnKnown_Measurements <- UnKnown_Measurements %>%
  mutate(h = factor(lubridate::hour(Date_Time))) %>%
  mutate(day = lubridate::day(Date_Time)) %>%
  mutate(month = month(Date_Time, label = TRUE))
  

UnKnown_Measurements$pred_final <- predict(rf_model_4, UnKnown_Measurements, 'response')

Code

UnKnown_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=pred_base)) +
  geom_point()

Warning: Removed 86 rows containing missing values or values outside the scale range
(`geom_point()`).

Code

UnKnown_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=pred_new)) +
  geom_point()

Warning: Removed 86 rows containing missing values or values outside the scale range
(`geom_point()`).

Code

UnKnown_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=pred_slim)) +
  geom_point()

Warning: Removed 86 rows containing missing values or values outside the scale range
(`geom_point()`).

Code

UnKnown_Measurements %>%
  ggplot(aes(x=Date_Time, y=Value_Num, color=pred_final)) +
  geom_point()

Warning: Removed 86 rows containing missing values or values outside the scale range
(`geom_point()`).

10.1 Diabetes Data Set

11 Download Data

11.1 Load Data

11.2 View Data

11.3 UnKnown Measurements

11.4 Split Training Data

11.5 Fit Model 1

11.5.1 Random Forest Ouput

11.6 Evaluate Model Performace

11.7 Correlated Probabilites

11.8 Make Adjustments

11.9 Make Adjustments, Again

11.10 Extract More Data

11.11 Predict UnKnown Data