32 Random Forest Imputation and SVM Multi-class Classifier

\(~\)

32.1 `NHANES` data

#install.packages('LiblineaR')


library('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(NHANES)

NHANES_DATA_12 <- NHANES %>%
  filter(!is.na(Depressed))

32.1.1 Note that `Depressed` is has 3 potiential classes

NHANES_DATA_12 %>% 
  select(Depressed) %>%
  distinct()

# A tibble: 3 × 1
  Depressed
  <fct>    
1 Several  
2 None     
3 Most

\(~\)

32.2 Some data prep

SumNa <- function(col){sum(is.na(col))}

data.sum <- NHANES_DATA_12 %>% 
  summarise_all(SumNa) %>%
  tidyr::gather(key='feature', value='SumNa') %>%
  arrange(-SumNa) %>%
  mutate(PctNa = SumNa/nrow(NHANES_DATA_12))

data.sum2 <- data.sum %>% 
  filter(! (feature %in% c('ID','Depressed'))) %>%
  filter(PctNa < .85)

data.sum2$feature

 [1] "UrineFlow2"      "UrineVol2"       "AgeRegMarij"     "PregnantNow"    
 [5] "Age1stBaby"      "nBabies"         "nPregnancies"    "SmokeAge"       
 [9] "AgeFirstMarij"   "SmokeNow"        "Testosterone"    "AgeMonths"      
[13] "TVHrsDay"        "Race3"           "CompHrsDay"      "PhysActiveDays" 
[17] "SexOrientation"  "AlcoholDay"      "SexNumPartYear"  "Marijuana"      
[21] "RegularMarij"    "SexAge"          "SexNumPartnLife" "HardDrugs"      
[25] "SexEver"         "SameSex"         "AlcoholYear"     "HHIncome"       
[29] "HHIncomeMid"     "Poverty"         "UrineFlow1"      "BPSys1"         
[33] "BPDia1"          "AgeDecade"       "DirectChol"      "TotChol"        
[37] "BPSys2"          "BPDia2"          "Education"       "BPSys3"         
[41] "BPDia3"          "MaritalStatus"   "Smoke100"        "Smoke100n"      
[45] "Alcohol12PlusYr" "BPSysAve"        "BPDiaAve"        "Pulse"          
[49] "BMI_WHO"         "BMI"             "Weight"          "HomeRooms"      
[53] "Height"          "HomeOwn"         "UrineVol1"       "SleepHrsNight"  
[57] "LittleInterest"  "DaysPhysHlthBad" "DaysMentHlthBad" "Diabetes"       
[61] "Work"            "SurveyYr"        "Gender"          "Age"            
[65] "Race1"           "HealthGen"       "SleepTrouble"    "PhysActive"

data_F <- NHANES_DATA_12 %>% 
  select(ID, Depressed, data.sum2$feature) %>%
  filter(!is.na(Depressed))

32.2.1 note that `data_F` still has missing values

Amelia::missmap(as.data.frame(data_F))

\(~\)

32.3 Random Forest Impute with `rfImpute`

library('randomForest')

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

data_F.imputed <- rfImpute(Depressed ~ . , 
                           data_F, 
                           iter=2, 
                           ntree=300)

ntree      OOB      1      2      3
  300:   8.41%  0.88% 37.86% 31.82%
ntree      OOB      1      2      3
  300:   8.69%  1.24% 37.26% 33.25%

32.3.0.1 Note we no longer have missing data

Amelia::missmap(as.data.frame(data_F.imputed))

\(~\)

32.4 Split Data

library('caret')

Loading required package: lattice


Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

set.seed(8576309)

trainIndex <- createDataPartition(data_F.imputed$Depressed, 
                                  p = .6, 
                                  list = FALSE, 
                                  times = 1)

TRAIN <- data_F.imputed[trainIndex, ]
TEST <- data_F.imputed[-trainIndex, ]

\(~\)

32.5 Train model

To see list of models compatible and tuneable hyperparameters within the caret package you can visit: https://topepo.github.io/caret/available-models.html

SVMGrid <- expand.grid(cost = c(1,2,5), Loss = c(1,5,6))

train_ctrl <- trainControl(method="cv", # type of resampling in this case Cross-Validated
                           number=3)

toc <- Sys.time()
model_svm <- train(Depressed ~ .,
                       data = TRAIN,
                       method = "svmLinear3", # this will use the svmLinear3 library
                       metric = "Accuracy", # which metric should be optimized for 
                       trControl = train_ctrl,
                       tuneGrid = SVMGrid) 
tic <- Sys.time()

tic - toc

Time difference of 38.81407 secs

32.5.1 Model output

model_svm

L2 Regularized Support Vector Machine (dual) with Linear Kernel 

4005 samples
  69 predictor
   3 classes: 'None', 'Several', 'Most' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 2670, 2669, 2671 
Resampling results across tuning parameters:

  cost  Loss  Accuracy   Kappa      
  1     1     0.6631711  0.013007576
  1     5     0.5357612  0.004280989
  1     6     0.6512719  0.048603102
  2     1     0.7565349  0.051560339
  2     5     0.4650565  0.070901994
  2     6     0.6824747  0.025679637
  5     1     0.7835188  0.004424675
  5     5     0.1959271  0.023878197
  5     6     0.7860176  0.000000000

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cost = 5 and Loss = 6.

varImp(model_svm)

ROC curve variable importance

  variables are sorted by maximum importance across the classes
  only 20 most important variables shown (out of 69)

                  None Several   Most
DaysMentHlthBad 100.00   70.34 100.00
LittleInterest   90.65   57.64  90.65
Age1stBaby       78.42   50.99  78.42
SmokeNow         57.83   38.10  57.83
HealthGen        51.05   30.48  51.05
HHIncome         45.06   26.34  45.06
Poverty          43.94   25.47  43.94
nPregnancies     41.70   30.68  41.70
HHIncomeMid      39.67   23.78  39.67
Education        39.44   25.66  39.44
DaysPhysHlthBad  34.63   19.10  34.63
SleepTrouble     33.58   19.72  33.58
Work             33.02   23.56  33.02
SleepHrsNight    32.03   17.33  32.03
Testosterone     31.72   22.65  31.72
HomeRooms        31.38   17.07  31.38
nBabies          31.13   17.79  31.13
PregnantNow      31.13   18.41  31.13
TVHrsDay         30.26   27.76  30.26
PhysActive       29.62   15.26  29.62

plot(varImp(model_svm), top = 20)

32.1 NHANES data

32.1.1 Note that Depressed is has 3 potiential classes

32.2 Some data prep

32.2.1 note that data_F still has missing values

32.3 Random Forest Impute with rfImpute

32.3.0.1 Note we no longer have missing data

32.4 Split Data

32.5 Train model

32.5.1 Model output

32.1 `NHANES` data

32.1.1 Note that `Depressed` is has 3 potiential classes

32.2.1 note that `data_F` still has missing values

32.3 Random Forest Impute with `rfImpute`