Data Science At Scale
Last Part, we discussed statistical tests including t-test, chi-square test, ks-test, & ANOVA on a couple features in our analytic data set, we compared and contrasted these p-values and traced the downstream effect of the feature in a simple logistic regression. This illustrates the importance of Exploratory Data Analysis. However, today’s data-scientist has to contend with data-sets consisting of hundreds or thousands of features to analyze. In this Part we will consider the scalability of our analytics - how do we perform Exploratory Data Analysis on hundreds of features in a dataset?
In define three new classes of features arising from Demographics (Section 16.1 Demographic Features), Labs (Section 16.2 Lab Features), and Examination (Section 16.3 Examination Features) Domains. Most of this is a review of use of case_when
; however, the Labs do present us with some challenges worth examination in Section 16.2.1 Mapping Column Issues.
Our primary discussion will be around learning how to functionalize some of our processes using R
, we have over 100 columns to analyze and copying and pasting code is ineffective.
In Chapter #sec-functional-dbplyr-purrr-and-furrr we will
- introduce several concepts including
enquo
and variable resolution with!!
. - showcase the
comparedf
andtableby
functions inarsenal
- use
purrr
andfurrr
to iterate and speed up functions
In Chapter 9 Exploratory Data Analysis at Scale as we continue to analyze the data, we will
- discuss variants of normal
dplyr
functions with their_at
brethren:mutate_at
;summarise_at
,filter_at
, and others. - discuss missing data analytics & mean value imputation.
- showcase a few different packages that look for correlated features and discuss why we want to look for correlated features.
- review Principal Component Analysis and k-means clustering as means of data reduction.
- give many examples of how to easily define useful functions to aide in your analysis
- showcase
DataExplorer
,skimer
,GGplot
as packages for to assist with automation of EDA tasks