Overview
In this course we will review some of the tools of the trade, namely, R
’s tidyverse (Wickham and Grolemund 2017; Winter 2019) - a collection of R
packages designed with a common framework to aide in common data wrangling and data management tasks.
Data Wrangling is one subset set of skills within the Data Science Process. We will carefully investigate how decisions made while collecting and preparing the data have down-stream effects on model performance.
Analysis is worthless if it goes un-communicated. Stakeholders need regular up-to-the-date information to act upon. Luckily, RMarkdown (Xie, Dervieux, and Riederer 2020; Xie, Allaire, and Grolemund 2018) and Quarto with the aide of the knitr (Xie 2022, 2015) and shiny packages can make R
seamlessly integrate into reporting tasks including:
- HTML Web-pages
- Dashboards
- Power-Points
- Word-Documents
- Excel workbooks
- PDF reports
- Journal Submissions
- Books (Xie, Allaire, and Grolemund 2018; Xie 2016; Mora 2018)
RStudio features a Visual Markdown Editor, which is very nice if you want to work on editing reports or documents and enjoy the “what-you-see-is-what-you-get” type interface over code. I do find flipping between both to be handy if I am wondering what it might look like before rendering a document, it’s been a time saver in developing this text!
Finally, Combined with R’s modeling capabilities the entire data science process: from data ingestion to modeling to package development and version control can all be managed nicely with an RStudio Console.
We will go through what some might call a boilerplate pass - and walk through how to get started with these various tools to solve common data questions.
The primary resource is an SQLite database made from downloading various files from the NHANES data source:
The goal here is to try to give that experience of connecting and working with a to a database with R
. Collecting data from a potentially database, running statistical analyses, and making inferences as to which features would perform well when fitting predictive models.
PART I - Welcome to R
In the first part of the text, we will cover getting started with R
, we will install packages that we will utilize throughout the rest of the book, and we will introduce the tidyverse.
-
2 Getting Started with R - Getting Started with
R
- 3 The tidyverse - The Tidyverse
PART II - Feature Engineering & Data Visualization
In this part we will define a few features, targets, and other data-points of interest including: Gender, Age, Diabetic status, Age at Diabetes. We will briefly use these features to discuss ploting with ggplot2
in R
.
- 4 Feature Engineering - Feature Engineering
-
5 The Anatomy of ggplot - The Anatomy of
ggplot
PART III - From Exploratory Data Analysis to Predictive Models
We use our data-set with the few features we defined in the last part and review statistical tests such as the t-test, ks-test, and chi-square test and ANOVA. We will showcase the relationship between p-values of statistical tests and corresponding and model accuracy we discuss two factor classification with:
- 6 Two Factor Classification with a Single Continuous Feature - A single continuous feature
- 7 Two Factor Classification with Categorical and Continuous Interactions - categorical features and interactions
PART IV - Data Analytics at Scale
It’s unrealistic that we will have only 3 or 4 features to review, we need to understand how to make R
work for us.
We have provided features over three domains of interest, they include:
- 16.1 Demographic Features - Demographic - Feature Engineering
- 16.2 Lab Features - Labs - Feature Engineering
- 16.3 Examination Features - Examination - Feature Engineering
For the most part, these features are mapped in with similar methods we utilized in Part II; however, there are some issues when dealing with the Lab data.
We will utilize these domains to create a new analytic data-set with hundreds of columns and then discuss
-
8 Functional dbplyr, purrr, and furrr - Functional
dbplyr
,purrr
, andfurrr
- 9 Exploratory Data Analysis at Scale - Exploratory Data Analysis at Scale
PART V - Time Series and Random Forest
- 10 Random Forest Date/Time Modeling Example - Diabetic monitoring; time-series classification
PART VI - Communication, Packages, & Git
A large aspect of Data Science is communication of results to stakeholders, in this Part we will introduce Shiny and flex_dashboards as well as discuss options when we knit
an R
markdown file.
- 11 Quarto & RMarkdown - Quarto & RMarkdown
- 12 Shiny - Shiny
-
13 Create an R package - Create an
R
package - 14 Git - Git
This work was completed using R version 4.4.1 (R Core Team 2024a) with the following R packages: Amelia v. 1.8.2 (Honaker, King, and Blackwell 2011), AMR v. 2.1.1 (Berends et al. 2022), archive v. 1.1.8 (Hester and Csárdi 2024), arsenal v. 3.6.3 (Heinzen et al. 2021), caret v. 6.0.94 (Kuhn and Max 2008), clue v. 0.3.65 (Hornik 2005, 2023), corrplot v. 0.92 (Wei and Simko 2021), corrr v. 0.4.4 (Kuhn, Jackson, and Cimentada 2022), credentials v. 2.0.1 (Ooms 2023), DBI v. 1.2.3 (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2024), DiagrammeR v. 1.0.11 (Iannone and Roy 2024), duckdb v. 1.0.0 (Mühleisen and Raasveldt 2024), furrr v. 0.3.1 (Vaughan and Dancho 2022), future v. 1.33.2 [@], GGally v. 2.2.1 (Schloerke et al. 2024), ggdendro v. 0.2.0 (de Vries and Ripley 2024), ggraph v. 2.2.1 (Pedersen 2024), gplots v. 3.1.3.1 (Warnes et al. 2024), grateful v. 0.2.4 (Francisco Rodriguez-Sanchez and Connor P. Jackson 2023), gridExtra v. 2.3 (Auguie 2017), gt v. 0.10.1 (Iannone et al. 2024), here v. 1.0.1 (Müller 2020), Hmisc v. 5.1.3 (Harrell Jr 2024), igraph v. 2.0.3 (Csardi and Nepusz 2006; Csárdi et al. 2024), kableExtra v. 1.4.0 (Zhu 2024), knitr v. 1.47 (Xie 2014, 2015, 2024), lattice v. 0.22.6 (Sarkar 2008), mice v. 3.16.0 (van Buuren and Groothuis-Oudshoorn 2011), microbenchmark v. 1.4.10 (Mersmann 2023), NCmisc v. 1.2.0 (Cooper 2022), quarto v. 1.4 (Allaire and Dervieux 2024), randomForest v. 4.7.1.1 (Liaw and Wiener 2002), renv v. 1.0.7 (Ushey and Wickham 2024), rmarkdown v. 2.27 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; Allaire et al. 2024), rsq v. 2.6 (Zhang 2023), RSQLite v. 2.3.7 (Müller et al. 2024), scales v. 1.3.0 (Wickham, Pedersen, and Seidel 2023), splines v. 4.4.1 (R Core Team 2024b), tidymodels v. 1.2.0 (Kuhn and Wickham 2020), tidyverse v. 2.0.0 (Wickham et al. 2019), tree v. 1.0.43 (Ripley 2023), vcd v. 1.4.12 (Meyer, Zeileis, and Hornik 2006; Zeileis, Meyer, and Hornik 2007; Meyer et al. 2023), VIM v. 6.2.2 (Kowarik and Templ 2016), yardstick v. 1.3.1 (Kuhn, Vaughan, and Hvitfeldt 2024), running in RStudio v. 2024.4.2.764 (Posit team 2024).
This work was completed using Quarto https://quarto.org/ version 1.4.553.