Overview

In this course we will review some of the tools of the trade, namely, R’s tidyverse (Wickham and Grolemund 2017; Winter 2019) - a collection of R packages designed with a common framework to aide in common data wrangling and data management tasks.

Data Wrangling is one subset set of skills within the Data Science Process. We will carefully investigate how decisions made while collecting and preparing the data have down-stream effects on model performance.

Analysis is worthless if it goes un-communicated. Stakeholders need regular up-to-the-date information to act upon. Luckily, RMarkdown (Xie, Dervieux, and Riederer 2020; Xie, Allaire, and Grolemund 2018) and Quarto with the aide of the knitr (Xie 2022, 2015) and shiny packages can make R seamlessly integrate into reporting tasks including:

HTML Web-pages
Dashboards
Power-Points
Word-Documents
Excel workbooks
PDF reports
Journal Submissions
Books (Xie, Allaire, and Grolemund 2018; Xie 2016; Mora 2018)

RStudio features a Visual Markdown Editor, which is very nice if you want to work on editing reports or documents and enjoy the “what-you-see-is-what-you-get” type interface over code. I do find flipping between both to be handy if I am wondering what it might look like before rendering a document, it’s been a time saver in developing this text!

Finally, Combined with R’s modeling capabilities the entire data science process: from data ingestion to modeling to package development and version control can all be managed nicely with an RStudio Console.

We will go through what some might call a boilerplate pass - and walk through how to get started with these various tools to solve common data questions.

The primary resource is an SQLite database made from downloading various files from the NHANES data source:

https://wwwn.cdc.gov/nchs/nhanes/Default.aspx

The goal here is to try to give that experience of connecting and working with a to a database with R. Collecting data from a potentially database, running statistical analyses, and making inferences as to which features would perform well when fitting predictive models.

PART I - Welcome to `R`

In the first part of the text, we will cover getting started with R, we will install packages that we will utilize throughout the rest of the book, and we will introduce the tidyverse.

2 Getting Started with R - Getting Started with R
3 The tidyverse - The Tidyverse

PART II - Feature Engineering & Data Visualization

In this part we will define a few features, targets, and other data-points of interest including: Gender, Age, Diabetic status, Age at Diabetes. We will briefly use these features to discuss ploting with ggplot2 in R.

4 Feature Engineering - Feature Engineering
5 The Anatomy of ggplot - The Anatomy of ggplot

PART III - From Exploratory Data Analysis to Predictive Models

We use our data-set with the few features we defined in the last part and review statistical tests such as the t-test, ks-test, and chi-square test and ANOVA. We will showcase the relationship between p-values of statistical tests and corresponding and model accuracy we discuss two factor classification with:

6 Two Factor Classification with a Single Continuous Feature - A single continuous feature
7 Two Factor Classification with Categorical and Continuous Interactions - categorical features and interactions

PART IV - Data Analytics at Scale

It’s unrealistic that we will have only 3 or 4 features to review, we need to understand how to make R work for us.

We have provided features over three domains of interest, they include:

16.1 Demographic Features - Demographic - Feature Engineering
16.2 Lab Features - Labs - Feature Engineering
16.3 Examination Features - Examination - Feature Engineering

For the most part, these features are mapped in with similar methods we utilized in Part II; however, there are some issues when dealing with the Lab data.

We will utilize these domains to create a new analytic data-set with hundreds of columns and then discuss

8 Functional dbplyr, purrr, and furrr - Functional dbplyr, purrr, and furrr
9 Exploratory Data Analysis at Scale - Exploratory Data Analysis at Scale

PART V - Time Series and Random Forest

10 Random Forest Date/Time Modeling Example - Diabetic monitoring; time-series classification

PART VI - Communication, Packages, & Git

A large aspect of Data Science is communication of results to stakeholders, in this Part we will introduce Shiny and flex_dashboards as well as discuss options when we knit an R markdown file.

11 Quarto & RMarkdown - Quarto & RMarkdown
12 Shiny - Shiny
13 Create an R package - Create an R package
14 Git - Git

This work was completed using R version 4.4.1 (R Core Team 2024a) with the following R packages: Amelia v. 1.8.2 (Honaker, King, and Blackwell 2011), AMR v. 2.1.1 (Berends et al. 2022), archive v. 1.1.8 (Hester and Csárdi 2024), arsenal v. 3.6.3 (Heinzen et al. 2021), caret v. 6.0.94 (Kuhn and Max 2008), clue v. 0.3.65 (Hornik 2005, 2023), corrplot v. 0.92 (Wei and Simko 2021), corrr v. 0.4.4 (Kuhn, Jackson, and Cimentada 2022), credentials v. 2.0.1 (Ooms 2023), DBI v. 1.2.3 (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2024), DiagrammeR v. 1.0.11 (Iannone and Roy 2024), duckdb v. 1.0.0 (Mühleisen and Raasveldt 2024), furrr v. 0.3.1 (Vaughan and Dancho 2022), future v. 1.33.2 [@], GGally v. 2.2.1 (Schloerke et al. 2024), ggdendro v. 0.2.0 (de Vries and Ripley 2024), ggraph v. 2.2.1 (Pedersen 2024), gplots v. 3.1.3.1 (Warnes et al. 2024), grateful v. 0.2.4 (Francisco Rodriguez-Sanchez and Connor P. Jackson 2023), gridExtra v. 2.3 (Auguie 2017), gt v. 0.10.1 (Iannone et al. 2024), here v. 1.0.1 (Müller 2020), Hmisc v. 5.1.3 (Harrell Jr 2024), igraph v. 2.0.3 (Csardi and Nepusz 2006; Csárdi et al. 2024), kableExtra v. 1.4.0 (Zhu 2024), knitr v. 1.47 (Xie 2014, 2015, 2024), lattice v. 0.22.6 (Sarkar 2008), mice v. 3.16.0 (van Buuren and Groothuis-Oudshoorn 2011), microbenchmark v. 1.4.10 (Mersmann 2023), NCmisc v. 1.2.0 (Cooper 2022), quarto v. 1.4 (Allaire and Dervieux 2024), randomForest v. 4.7.1.1 (Liaw and Wiener 2002), renv v. 1.0.7 (Ushey and Wickham 2024), rmarkdown v. 2.27 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; Allaire et al. 2024), rsq v. 2.6 (Zhang 2023), RSQLite v. 2.3.7 (Müller et al. 2024), scales v. 1.3.0 (Wickham, Pedersen, and Seidel 2023), splines v. 4.4.1 (R Core Team 2024b), tidymodels v. 1.2.0 (Kuhn and Wickham 2020), tidyverse v. 2.0.0 (Wickham et al. 2019), tree v. 1.0.43 (Ripley 2023), vcd v. 1.4.12 (Meyer, Zeileis, and Hornik 2006; Zeileis, Meyer, and Hornik 2007; Meyer et al. 2023), VIM v. 6.2.2 (Kowarik and Templ 2016), yardstick v. 1.3.1 (Kuhn, Vaughan, and Hvitfeldt 2024), running in RStudio v. 2024.4.2.764 (Posit team 2024).

This work was completed using Quarto https://quarto.org/ version 1.4.553.

Allaire, JJ, and Christophe Dervieux. 2024. quarto: R Interface to “Quarto” Markdown Publishing System. https://CRAN.R-project.org/package=quarto.

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Auguie, Baptiste. 2017. gridExtra: Miscellaneous Functions for “Grid” Graphics. https://CRAN.R-project.org/package=gridExtra.

Berends, Matthijs S., Christian F. Luz, Alexander W. Friedrich, Bhanu N. M. Sinha, Casper J. Albers, and Corinna Glasner. 2022. “AMR: An R Package for Working with Antimicrobial Resistance Data.” Journal of Statistical Software 104 (3): 1–31. https://doi.org/10.18637/jss.v104.i03.

Cooper, Nicholas. 2022. NCmisc: Miscellaneous Functions for Creating Adaptive Functions and Scripts. https://CRAN.R-project.org/package=NCmisc.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.

Csárdi, Gábor, Tamás Nepusz, Vincent Traag, Szabolcs Horvát, Fabio Zanini, Daniel Noom, and Kirill Müller. 2024. igraph: Network Analysis and Visualization in r. https://doi.org/10.5281/zenodo.7682609.

de Vries, Andrie, and Brian D. Ripley. 2024. ggdendro: Create Dendrograms and Tree Diagrams Using “ggplot2”. https://CRAN.R-project.org/package=ggdendro.

Francisco Rodriguez-Sanchez, and Connor P. Jackson. 2023. grateful: Facilitate Citation of r Packages. https://pakillo.github.io/grateful/.

Harrell Jr, Frank E. 2024. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

Heinzen, Ethan, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson, and Gregory Dougherty. 2021. arsenal: An Arsenal of “R” Functions for Large-Scale Statistical Summaries. https://CRAN.R-project.org/package=arsenal.

Hester, Jim, and Gábor Csárdi. 2024. archive: Multi-Format Archive and Compression Support. https://CRAN.R-project.org/package=archive.

Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45 (7): 1–47. https://doi.org/10.18637/jss.v045.i07.

Hornik, Kurt. 2005. “A CLUE for CLUster Ensembles.” Journal of Statistical Software 14 (12). https://doi.org/10.18637/jss.v014.i12.

———. 2023. clue: Cluster Ensembles. https://CRAN.R-project.org/package=clue.

Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2024. gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.

Iannone, Richard, and Olivier Roy. 2024. DiagrammeR: Graph/Network Visualization. https://CRAN.R-project.org/package=DiagrammeR.

Kowarik, Alexander, and Matthias Templ. 2016. “Imputation with the R Package VIM.” Journal of Statistical Software 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.

Kuhn, Max, Simon Jackson, and Jorge Cimentada. 2022. corrr: Correlations in r. https://CRAN.R-project.org/package=corrr.

Kuhn, Max, Davis Vaughan, and Emil Hvitfeldt. 2024. yardstick: Tidy Characterizations of Model Performance. https://CRAN.R-project.org/package=yardstick.

Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.

Kuhn, and Max. 2008. “Building Predictive Models in r Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.

Mersmann, Olaf. 2023. microbenchmark: Accurate Timing Functions. https://CRAN.R-project.org/package=microbenchmark.

Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. “The Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.” Journal of Statistical Software 17 (3): 1–48. https://doi.org/10.18637/jss.v017.i03.

Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2023. vcd: Visualizing Categorical Data. https://CRAN.R-project.org/package=vcd.

Mora, Pedro M. Valero. 2018. “Bookdown: Authoring Books and Technical Documents with R Markdown.” Journal of Statistical Software 87 (Book Review 1). https://doi.org/10.18637/jss.v087.b01.

Mühleisen, Hannes, and Mark Raasveldt. 2024. duckdb: DBI Package for the DuckDB Database Management System. https://CRAN.R-project.org/package=duckdb.

Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2024. RSQLite: SQLite Interface for r. https://CRAN.R-project.org/package=RSQLite.

Ooms, Jeroen. 2023. credentials: Tools for Managing SSH and Git Credentials. https://docs.ropensci.org/credentials/ https://r-lib.r-universe.dev/credentials.

Pedersen, Thomas Lin. 2024. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. https://CRAN.R-project.org/package=ggraph.

Posit team. 2024. RStudio: Integrated Development Environment for r. Boston, MA: Posit Software, PBC. http://www.posit.co/.

R Core Team. 2024a. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

———. 2024b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2024. DBI: R Database Interface. https://CRAN.R-project.org/package=DBI.

Ripley, Brian. 2023. tree: Classification and Regression Trees. https://CRAN.R-project.org/package=tree.

Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.

Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://CRAN.R-project.org/package=GGally.

Ushey, Kevin, and Hadley Wickham. 2024. renv: Project Environments. https://CRAN.R-project.org/package=renv.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

Vaughan, Davis, and Matt Dancho. 2022. furrr: Apply Mapping Functions in Parallel Using Futures. https://CRAN.R-project.org/package=furrr.

Warnes, Gregory R., Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber, Andy Liaw, Thomas Lumley, et al. 2024. gplots: Various r Programming Tools for Plotting Data. https://CRAN.R-project.org/package=gplots.

Wei, Taiyun, and Viliam Simko. 2021. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.

Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.

Winter, Bodo. 2019. “The Tidyverse and Reproducible r Workflows.” In, 27–52. Routledge. https://doi.org/10.4324/9781315165547-2.

Xie, Yihui. 2014. “knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2016. Bookdown. Chapman; Hall/CRC. https://doi.org/10.1201/9781315204963.

———. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

———. 2024. knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Zeileis, Achim, David Meyer, and Kurt Hornik. 2007. “Residual-Based Shadings for Visualizing (Conditional) Independence.” Journal of Computational and Graphical Statistics 16 (3): 507–25. https://doi.org/10.1198/106186007X237856.

Zhang, Dabao. 2023. rsq: R-Squared and Related Measures. https://CRAN.R-project.org/package=rsq.

Zhu, Hao. 2024. kableExtra: Construct Complex Table with “kable” and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

PART I - Welcome to R