Data Science At Scale

Last Part, we discussed statistical tests including t-test, chi-square test, ks-test, & ANOVA on a couple features in our analytic data set, we compared and contrasted these p-values and traced the downstream effect of the feature in a simple logistic regression. This illustrates the importance of Exploratory Data Analysis. However, today’s data-scientist has to contend with data-sets consisting of hundreds or thousands of features to analyze. In this Part we will consider the scalability of our analytics - how do we perform Exploratory Data Analysis on hundreds of features in a dataset?

In define three new classes of features arising from Demographics (Section 16.1 Demographic Features), Labs (Section 16.2 Lab Features), and Examination (Section 16.3 Examination Features) Domains. Most of this is a review of use of case_when; however, the Labs do present us with some challenges worth examination in Section 16.2.1 Mapping Column Issues.

Our primary discussion will be around learning how to functionalize some of our processes using R, we have over 100 columns to analyze and copying and pasting code is ineffective.

In Chapter #sec-functional-dbplyr-purrr-and-furrr we will

In Chapter 9  Exploratory Data Analysis at Scale as we continue to analyze the data, we will