This work was completed using R version 4.4.0 (R Core Team 2024) with the following R packages: Amelia v. 1.8.2 (Honaker, King, and Blackwell 2011), arsenal v. 3.6.3 (Heinzen et al. 2021), caret v. 6.0.94 (Kuhn and Max 2008), class v. 7.3.22 (Venables and Ripley 2002), clue v. 0.3.65 (Hornik 2005, 2023), compare v. 0.2.6 (Murrell 2015), corrplot v. 0.92 (Wei and Simko 2021), data.table v. 1.15.4 (Barrett et al. 2024), doParallel v. 1.0.17 (Corporation and Weston 2022), e1071 v. 1.7.14 (Meyer, Dimitriadou, et al. 2023), factoextra v. 1.0.7 (Kassambara and Mundt 2020), gcookbook v. 2.0 (Chang 2018), GGally v. 2.2.1 (Schloerke et al. 2024), ggbernie v. 1.0 (CODER 2024), ggbiplot v. 0.6.2 (Vu and Friendly 2024), ggdendro v. 0.2.0 (de Vries and Ripley 2024), glmnet v. 4.1.8 (Friedman, Tibshirani, and Hastie 2010; Simon et al. 2011; Tay, Narasimhan, and Hastie 2023), glue v. 1.7.0 (Hester and Bryan 2024), gplots v. 3.1.3.1 (Warnes et al. 2024), grateful v. 0.2.4 (Francisco Rodriguez-Sanchez and Connor P. Jackson 2023), gtools v. 3.9.5 (Warnes et al. 2023), gtrendsR v. 1.5.1 (Massicotte and Eddelbuettel 2022), here v. 1.0.1 (Müller 2020), Hmisc v. 5.1.2 (Harrell Jr 2024), ISLR v. 1.4 (James et al. 2021), kableExtra v. 1.4.0 (Zhu 2024), klaR v. 1.7.3 (Weihs et al. 2005), knitr v. 1.46 (Xie 2014, 2015, 2024), labelled v. 2.13.0 (Larmarange 2024), lattice v. 0.22.6 (Sarkar 2008), mice v. 3.16.0 (van Buuren and Groothuis-Oudshoorn 2011), networkD3 v. 0.4 (J. J. Allaire et al. 2017), NHANES v. 2.1.0 (Pruim 2015), psych v. 2.4.3 (William Revelle 2024), quarto v. 1.4 (J. Allaire and Dervieux 2024), randomForest v. 4.7.1.1 (Liaw and Wiener 2002), renv v. 1.0.7 (Ushey and Wickham 2024), reprtree v. 0.6 (Dasgupta 2014), rmarkdown v. 2.27 (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020; J. Allaire et al. 2024), rpart v. 4.1.23 (Therneau and Atkinson 2023), rpart.plot v. 3.1.2 (Milborrow 2024), rsample v. 1.2.1 (Frick et al. 2024), rsq v. 2.6 (Zhang 2023), scales v. 1.3.0 (Wickham, Pedersen, and Seidel 2023), skimr v. 2.1.5 (Waring et al. 2022), sqldf v. 0.4.11 (Grothendieck 2017), tableone v. 0.13.2 (Yoshida and Bartel 2022), tictoc v. 1.2.1 (Izrailev 2024), tidymodels v. 1.2.0 (Kuhn and Wickham 2020), tidyverse v. 2.0.0 (Wickham et al. 2019), tree v. 1.0.43 (Ripley 2023), vcd v. 1.4.12 (Meyer, Zeileis, and Hornik 2006; Zeileis, Meyer, and Hornik 2007; Meyer, Zeileis, et al. 2023), VIM v. 6.2.2 (Kowarik and Templ 2016), viridis v. 0.6.5 (Garnier et al. 2024), yardstick v. 1.3.1 (Kuhn, Vaughan, and Hvitfeldt 2024).
Introduction to Data Science with R
Author
J Kyle Armstrong, PhD
Published
May 18, 2024
Preface
These notes have been developed with:
- R version 4.4.0
- Quarto version 1.4.553
R, RStudio, and Quarto are all free to download and use. You can learn more about both R and RStudio and download them both from https://posit.co/download/rstudio-desktop/.
To learn more about Quarto books visit https://quarto.org/docs/books.
Allaire, J. J., Christopher Gandrud, Kenton Russell, and CJ Yetman. 2017. networkD3: D3 JavaScript Network Graphs from r. https://CRAN.R-project.org/package=networkD3.
Allaire, JJ, and Christophe Dervieux. 2024. quarto: R Interface to “Quarto” Markdown Publishing System. https://CRAN.R-project.org/package=quarto.
Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Barrett, Tyson, Matt Dowle, Arun Srinivasan, Jan Gorecki, Michael Chirico, and Toby Hocking. 2024. data.table: Extension of “data.frame”. https://CRAN.R-project.org/package=data.table.
Chang, Winston. 2018. gcookbook: Data for “R Graphics Cookbook”. https://CRAN.R-project.org/package=gcookbook.
CODER, R. 2024. ggbernie: A Geom for Adding Bernies. https://github.com/R-CoderDotCom/ggbernie.
Corporation, Microsoft, and Steve Weston. 2022. doParallel: Foreach Parallel Adaptor for the “parallel” Package. https://github.com/RevolutionAnalytics/doparallel.
Dasgupta, Abhijit. 2014. reprtree: Representative Trees from Ensembles. https://github.com/araastat/reprtree.
de Vries, Andrie, and Brian D. Ripley. 2024. ggdendro: Create Dendrograms and Tree Diagrams Using “ggplot2”. https://CRAN.R-project.org/package=ggdendro.
Francisco Rodriguez-Sanchez, and Connor P. Jackson. 2023. grateful: Facilitate Citation of r Packages. https://pakillo.github.io/grateful/.
Frick, Hannah, Fanny Chow, Max Kuhn, Michael Mahoney, Julia Silge, and Hadley Wickham. 2024. rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.
Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, et al. 2024. viridis(Lite) - Colorblind-Friendly Color Maps for r. https://doi.org/10.5281/zenodo.4679423.
Grothendieck, G. 2017. sqldf: Manipulate r Data Frames Using SQL. https://CRAN.R-project.org/package=sqldf.
Harrell Jr, Frank E. 2024. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.
Heinzen, Ethan, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson, and Gregory Dougherty. 2021. arsenal: An Arsenal of “R” Functions for Large-Scale Statistical Summaries. https://CRAN.R-project.org/package=arsenal.
Hester, Jim, and Jennifer Bryan. 2024. glue: Interpreted String Literals. https://glue.tidyverse.org/.
Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45 (7): 1–47. https://doi.org/10.18637/jss.v045.i07.
Hornik, Kurt. 2005. “A CLUE for CLUster Ensembles.” Journal of Statistical Software 14 (12). https://doi.org/10.18637/jss.v014.i12.
———. 2023. clue: Cluster Ensembles. https://CRAN.R-project.org/package=clue.
Izrailev, Sergei. 2024. tictoc: Functions for Timing r Scripts, as Well as Implementations of “Stack” and “StackList” Structures. https://CRAN.R-project.org/package=tictoc.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://CRAN.R-project.org/package=ISLR.
Kassambara, Alboukadel, and Fabian Mundt. 2020. factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.
Kowarik, Alexander, and Matthias Templ. 2016. “Imputation with the R Package VIM.” Journal of Statistical Software 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.
Kuhn, Max, Davis Vaughan, and Emil Hvitfeldt. 2024. yardstick: Tidy Characterizations of Model Performance. https://CRAN.R-project.org/package=yardstick.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Kuhn, and Max. 2008. “Building Predictive Models in r Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Larmarange, Joseph. 2024. labelled: Manipulating Labelled Data. https://CRAN.R-project.org/package=labelled.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.
Massicotte, Philippe, and Dirk Eddelbuettel. 2022. gtrendsR: Perform and Display Google Trends Queries. https://CRAN.R-project.org/package=gtrendsR.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2023. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. “The Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.” Journal of Statistical Software 17 (3): 1–48. https://doi.org/10.18637/jss.v017.i03.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2023. vcd: Visualizing Categorical Data. https://CRAN.R-project.org/package=vcd.
Milborrow, Stephen. 2024. rpart.plot: Plot “rpart” Models: An Enhanced Version of “plot.rpart”. https://CRAN.R-project.org/package=rpart.plot.
Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
Murrell, Paul. 2015. compare: Comparing Objects for Differences. https://CRAN.R-project.org/package=compare.
Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ripley, Brian. 2023. tree: Classification and Regression Trees. https://CRAN.R-project.org/package=tree.
Sarkar, Deepayan. 2008. Lattice: Multivariate Data Visualization with r. New York: Springer. http://lmdvr.r-forge.r-project.org.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://CRAN.R-project.org/package=GGally.
Simon, Noah, Jerome Friedman, Robert Tibshirani, and Trevor Hastie. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. https://doi.org/10.18637/jss.v039.i05.
Tay, J. Kenneth, Balasubramanian Narasimhan, and Trevor Hastie. 2023. “Elastic Net Regularization Paths for All Generalized Linear Models.” Journal of Statistical Software 106 (1): 1–31. https://doi.org/10.18637/jss.v106.i01.
Therneau, Terry, and Beth Atkinson. 2023. rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.
Ushey, Kevin, and Hadley Wickham. 2024. renv: Project Environments. https://CRAN.R-project.org/package=renv.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.
Vu, Vincent Q., and Michael Friendly. 2024. ggbiplot: A Grammar of Graphics Implementation of Biplots. https://CRAN.R-project.org/package=ggbiplot.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Warnes, Gregory R., Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber, Andy Liaw, Thomas Lumley, et al. 2024. gplots: Various r Programming Tools for Plotting Data. https://CRAN.R-project.org/package=gplots.
Warnes, Gregory R., Ben Bolker, Thomas Lumley, Arni Magnusson, Bill Venables, Genei Ryodan, and Steffen Moeller. 2023. gtools: Various r Programming Tools. https://CRAN.R-project.org/package=gtools.
Wei, Taiyun, and Viliam Simko. 2021. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Weihs, Claus, Uwe Ligges, Karsten Luebke, and Nils Raabe. 2005. “klaR Analyzing German Business Cycles.” In Data Analysis and Decision Support, edited by D. Baier, R. Decker, and L. Schmidt-Thieme, 335–43. Berlin: Springer-Verlag.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
William Revelle. 2024. psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. https://CRAN.R-project.org/package=psych.
Xie, Yihui. 2014. “knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2024. knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
Yoshida, Kazuki, and Alexander Bartel. 2022. tableone: Create “Table 1” to Describe Baseline Characteristics with or Without Propensity Score Weights. https://CRAN.R-project.org/package=tableone.
Zeileis, Achim, David Meyer, and Kurt Hornik. 2007. “Residual-Based Shadings for Visualizing (Conditional) Independence.” Journal of Computational and Graphical Statistics 16 (3): 507–25. https://doi.org/10.1198/106186007X237856.
Zhang, Dabao. 2023. rsq: R-Squared and Related Measures. https://CRAN.R-project.org/package=rsq.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with “kable” and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.