Feature engineering is the process of transforming raw data into features that better represent the underlying measurements to be fed into the predictive model.
Definition 4.1 A feature is an individual measurable property or characteristic of a phenomenon being observed that can be used for analysis; some examples might include patient attributes such as Height, Weight, Age, or Gender. Depending on the prediction task or analysis, the features you include in the analytic data-set can widely vary. Features might also be thought-of or called:
independent variables
predictor variables
input variables
excoriates
explanatory variables
risk factors
depending on the context.
\(~\)
\(~\)
Definition 4.2 A target is a feature of interest we wish to gain a deeper understanding, analyze, or make a predictions on. Also called response or label, we often think of these as the dependent variable or outcome.
Targets in medical machine learning models typically include adverse events or clinical outcomes to various treatments or cohort designed studies.
\(~\)
\(~\)
Definition 4.3Data leakage is when information that would not normally be available to the model leaks into the training data-set.
\(~\)
\(~\)
For the majority of these exercises we will examine the process of data science with our primary example, we will consider Diabetes from the NHANES database as our primary outcome, all other available data with-in the data-base should be examined for it’s potential to be transformed into features.
4.2 Example : Diabetes
The target will be a “Yes” response to the question DIQ010:
The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?
Identification of the features for models is often an iterative and ongoing process.
Example 4.1 Hypothetically speaking, any respondent that has diabetes, has a date at which they first learned that they had diabetes.
Another expectation we might have is that: every respondent that reported a valid age at which they first learned of diabetes, also probably responded “Yes” to question DIQ010
Within the context of predicting Diabetes: if we already know the respondent’s age at which they first learned they had diabetes, then we expect the patient already has diabetes; this would be a good example of data leakage.
4.3 Features and Targets Vary by Prediction Task.
If we instead chose a prediction task on “depression”. Then our features, targets, and potential data-leakages all change.
Some features will be easier to define than others, as we shall see here. Above it was relatively easy to create a flag for DIABETES. However if we consider a similar but related data-point the Age at which the member first learned of their diagnosis it is more complicated:
4.6.1 Analytic Notes on DIQ
Analytic Notes
As per the Analytic notes on the DIQ table the “Age at Diabetes” has been mapped to different source variables over the number of years that the survey has been deployed. From the notes it appears that AGE_AT_DIAG_DM2 is recorded across DIQ040G, DIQ040Q, DID040G, DID040Q, DID040 depending on the yr_range.
We might want to review each of the ranges in the above table to check our understanding of the data:
So when yr_range == 1991-2000 then AGE_AT_DIAG_DM2 will be equal to DIQ040Q. However, when DIQ040Q is between 1 and 84 that is the age number. When DIQ040Q is 77777, 99999 or missing we do not know the member’s age number so we might as well classify all of these as missing.
4.6.3 DIQ : Age at Diabetes : 2001-2004
The second range of years occurs over multiple year ranges. We might make use of the METADATA table to assist us with querying the DIQ table.
For instance notice the METADATA table has a start_yr and end_yr numerical inputs on the same line associated with yr_range which is on the DIQ table:
So we can utilize the start_yr and end_year of the METADATA table in relation to the DIQ table:
Code
METADATA%>%filter(Valid_Table==1)%>%# Where Valid_Table = 1 filter(Table_Name=="DIQ")%>%# where Table_Name = 'DIQ'filter(2001<=start_yr&end_yr<=2004)%>%# where 2001 <= start_yr AND end_yr <= 2004select(yr_range)%>%# select yr_rangesinner_join(DIQ)%>%# JOIN DIQselect(SEQN, yr_range, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040)%>%# select columns of interestfilter(!(is.na(DIQ040G)&is.na(DIQ040Q)&is.na(DID040G)&is.na(DID040Q)&is.na(DID040)))%>%# one of these columns should be non-missingglimpse()
When 2001 <= start_yr & end_yr <= 2004 then DID040Q is the age number.
We have a couple more to check that are very similar to this.
We have two options as to how to proceed: 1. We can copy the above code 2x and modifying each one OR 2. we can make a helper function:
4.7AGE_AT_DIAG_VIEW helper function
We might want to standardize the above view as as an R function.
Below we take in input start_year and end_year’s query the METADATA table for all valid DIQ tables between those values:
Code
AGE_AT_DIAG_VIEW<-function(my_start_year, my_end_year){if(is.null(my_start_year)|is.null(my_end_year)){print("ERROR : my_start_year and my_end_year must be numeric int \n")}tmp<-METADATA%>%filter(Valid_Table==1)%>%filter(Table_Name=="DIQ")%>%filter(my_start_year<=start_yr&end_yr<=my_end_year)%>%select(yr_range)%>%inner_join(DIQ)%>%select(SEQN, yr_range, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040)%>%filter(!(is.na(DIQ040G)&is.na(DIQ040Q)&is.na(DID040G)&is.na(DID040Q)&is.na(DID040)))return(tmp)}
Notice that writing this function amounts to replicating the code one time with the following changes:
create a function called AGE_AT_DIAG_VIEW with AGE_AT_DIAG_VIEW <- function(my_start_year, my_end_year){
we added some error checks on the inputs my_start_year and my_end_year
we added tmp assignment to our dplyr string we found above
replace 2001 with my_start_year
replace 2004 with my_end_year
we returntmp and close the function definition.
First, let’s test it against some output we already are familiar with:
When 2005 <= start_yr & end_yr <= 2008 then DID040 is the age number.
4.7.2 DIG : Age at Diabetes : 2009-2018
Code
AGE_AT_DIAG_VIEW(2009 , 2018)
Joining with `by = join_by(yr_range)`
# Source: SQL [?? x 7]
# Database: sqlite 3.46.0 [C:\Users\jkyle\Documents\GitHub\Jeff_R_Data_Wrangling\DATA\sql_db\NHANES.sqlite]
SEQN yr_range DIQ040G DIQ040Q DID040G DID040Q DID040
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 51628 2009-2010 NA NA NA NA 56
2 51635 2009-2010 NA NA NA NA 70
3 51643 2009-2010 NA NA NA NA 34
4 51668 2009-2010 NA NA NA NA 25
5 51690 2009-2010 NA NA NA NA 55
6 51702 2009-2010 NA NA NA NA 35
7 51707 2009-2010 NA NA NA NA 61
8 51711 2009-2010 NA NA NA NA 51
9 51728 2009-2010 NA NA NA NA 27
10 51730 2009-2010 NA NA NA NA 65
# ℹ more rows
When 2009 <= start_yr & end_yr <= 2018 then DID040 is the age number.
4.8 Define AGE_AT_DIAG_DM2
Now that we have confirmed each of the columns that contain the age at which the member reports first having diabetes we are one step closer to defining the feature.
Age at Diabetes Diagnosis Multiple Sources
We know each of the age numbers has ranges of valid and invalid values that shifts year-to-year. We believe the following code can account for many of the individual data issues without great loss of data quality until a more through review of each tables can be performed:
In the first part of the query, from the METADATA to the select(SEQN, yr_range, start_yr, end_yr, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>% we are using the information we gathered from our helper function.
In the next mutate statement, we review the above information we gathered in applying the helper function and map DIQ040Q , DID040Q, and DID040 into AGE_AT_DIAG_DM2_1 by the proper corresponding start_yr and end_yr.
In the next mutate statement we make some educated guesses about the valid ranges of “Age at diagnosis”:
If it’s missing, it’s missing.
If it’s less then 1 or it’s greater than 84 it’s missing.
If the value is greater than 79 and less then 85 we will just average out the errors over time by using mean(79:85) = 82
Lastly, if the value is greater than 0 but less than 80 we will assume it was entered be the age at diagnosis of diabetes.
Lastly, we select only the information we need to avoid confusion.
An analytic data-set primarily consists of columns of targets and features to analyze the data or complete a prediction task.
Note there may be other useful columns that are assigned other roles in the predictive modeling tasks.
surrogate keys or other identifiers
example : SEQN is the surrogate key for a patient ID - each respondent has a unique SEQN
grouping or partitioning columns
example : yr_range the survey is taken in year ranges and there are some shifts year-to-year in column information
raw or source columns are also permitted
we might want to experiment with various definition of a feature “on-the-fly” so it might be helpful to keep some source information while we iterate on our analysis set.
Ultimately, The data science team is responsible to ensure all data-inputs into predictive models will be read as valid features, which will not heavily leak into the model prediction task. The goal of having analytic data-sets is to make analysis easier for researchers to understand and iterate on the overall data science pipeline.
4.12 Example
The Table A_DATA_TBL below is a simple example of an analytic data-set:
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
# Source: SQL [6 x 3]
# Database: sqlite 3.46.0 [C:\Users\jkyle\Documents\GitHub\Jeff_R_Data_Wrangling\DATA\sql_db\NHANES.sqlite]
DIABETES Gender n
<dbl> <chr> <int>
1 NA Female 2864
2 NA Male 2905
3 0 Female 45188
4 0 Male 43552
5 1 Female 3371
6 1 Male 3436
We can also continue to use dplyr after a summary table; here we use pivot_wider which is in the tidyr library, here, pivot_wider is creating new column names from the discrete values in the Gender column (names_from = Gender), the values that populate those entries will come from the n in the tally this specified with the values_from = n below:
Up until this point we have utilized dplyr and dbplyr to interface with a sqlite file.
Ultimately, R will be somewhat limited in what it can do only utilizing SQL connections to data-bases. Different data-bases will have different functionalities and slight variants of SQL; and what you are able to accomplish from a Tera-Data database might be different from what you can accomplish using a Spark or sqlite connection.
<SQL>
SELECT `LHS`.*, `DIABETES`, `AGE_AT_DIAG_DM2`
FROM (
SELECT
`SEQN`,
`yr_range`,
CASE WHEN ((`RIDAGEYR` IS NULL)) THEN (`RIDAGEMN` / 12.0) WHEN NOT ((`RIDAGEYR` IS NULL)) THEN `RIDAGEYR` END AS `Age`,
`Gender`
FROM (
SELECT
`DEMO`.*,
CASE
WHEN (`RIAGENDR` = 2.0) THEN 'Female'
WHEN (`RIAGENDR` = 1.0) THEN 'Male'
WHEN (`RIAGENDR` != 2.0 OR `RIAGENDR` != 1.0) THEN NULL
END AS `Gender`
FROM `DEMO`
) AS `q01`
) AS `LHS`
LEFT JOIN (
SELECT `DEMO`.`SEQN` AS `SEQN`, `DIABETES`, `AGE_AT_DIAG_DM2`
FROM `DEMO`
LEFT JOIN (
SELECT
`SEQN`,
CASE
WHEN (`DIQ010` = 1.0) THEN 1.0
WHEN (`DIQ010` = 2.0) THEN 0.0
WHEN ((`DIQ010` != 1.0 OR `DIQ010` != 2.0)) THEN NULL
END AS `DIABETES`
FROM `DIQ`
) AS `...2`
ON (`DEMO`.`SEQN` = `...2`.`SEQN`)
LEFT JOIN (
SELECT
`SEQN`,
CASE
WHEN ((`AGE_AT_DIAG_DM2_1` IS NULL)) THEN NULL
WHEN (`AGE_AT_DIAG_DM2_1` < 1.0 OR 84.0 < `AGE_AT_DIAG_DM2_1`) THEN NULL
WHEN (79.0 < `AGE_AT_DIAG_DM2_1` AND `AGE_AT_DIAG_DM2_1` < 85.0) THEN 82.0
WHEN (0.0 < `AGE_AT_DIAG_DM2_1` AND `AGE_AT_DIAG_DM2_1` < 80.0) THEN `AGE_AT_DIAG_DM2_1`
END AS `AGE_AT_DIAG_DM2`
FROM (
SELECT
`q01`.*,
CASE
WHEN (1991.0 <= `start_yr` AND `end_yr` <= 2000.0) THEN `DIQ040Q`
WHEN (2001.0 <= `start_yr` AND `end_yr` <= 2004.0) THEN `DID040Q`
WHEN (2005.0 <= `start_yr` AND `end_yr` <= 2018.0) THEN `DID040`
END AS `AGE_AT_DIAG_DM2_1`
FROM (
SELECT
`SEQN`,
`LHS`.*,
`DIQ040G`,
`DIQ040Q`,
`DID040G`,
`DID040Q`,
`DID040`
FROM (
SELECT `yr_range`, `start_yr`, `end_yr`
FROM `METADATA`
WHERE (`Valid_Table` = 1.0) AND (`Table_Name` = 'DIQ')
) AS `LHS`
LEFT JOIN `DIQ`
ON (`LHS`.`yr_range` = `DIQ`.`yr_range`)
) AS `q01`
) AS `q01`
) AS `...3`
ON (`DEMO`.`SEQN` = `...3`.`SEQN`)
) AS `RHS`
ON (`LHS`.`SEQN` = `RHS`.`SEQN`)
4.14.1collect what you need
Often, it is needed to collect data into the R environment to perform more detailed analysis than the SQL is capable of. Unless you are working with sparklyr, many R modeling packages will require the input data to be an R data-frame.
Remark thus far my attempts to provide a bad window function example per have been fruit-less thus far, however, I have found it doesn’t like all join types:
without having to download all of the data, we could do something like:
Code
A_DATA_TBL%>%filter(!is.na(DIABETES))%>%select(DIABETES, Age)%>%# select needed variables collect()%>%# collect only the data we want mutate(DIABETES_factor =as.factor(DIABETES))%>%#as.factor is an r function, SQLlite doesn't have "factor data type"ggplot(aes(x=Age, fill=DIABETES_factor))+geom_density()+labs(title ="Density Plot - Diabetes by Age")
# Feature Engineering {#sec-feature-engineering}**Feature engineering** is the process of transforming raw data into features that better represent the underlying measurements to be fed into the predictive model.We will make use of our connection helper:```{r}#| label: load_tidyverse#| cache: falselibrary('tidyverse')source(here::here('Functions/connect_help.R'))NHANES_Tables```## Features and Targets::: {#def-feature}A **feature** is an individual measurable property or characteristic of a phenomenon being observed that can be used for analysis; some examples might include patient *attributes* such as Height, Weight, Age, or Gender. Depending on the **prediction task** or analysis, the features you include in the analytic data-set can widely vary. **Features** might also be thought-of or called:1. **independent variables**2. **predictor variables**3. **input variables**4. **excoriates**5. **explanatory variables**6. **risk factors**depending on the context.:::$~$$~$::: {#def-target}A **target** is a *feature of interest* we wish to gain a deeper understanding, analyze, or make a predictions on. Also called **response** or **label**, we often think of these as the **dependent variable** or **outcome**.Targets in medical machine learning models typically include **adverse events** or **clinical outcomes** to various **treatments** or **cohort designed studies**.:::$~$$~$::: {#def-data-leakage} **Data leakage** is when information that would not normally be available to the model *leaks* into the **training data-set**. :::$~$$~$For the majority of these exercises we will examine the process of data science with our primary example, we will consider **Diabetes** from the NHANES database as our **primary outcome**, all other available data with-in the data-base should be examined for it's potential to be transformed into features.## Example : DiabetesThe **target** will be a "Yes" response to the question `DIQ010`:> The next questions are about specific medical conditions. {Other than during pregnancy, {have you/has SP}/{Have you/Has SP}} ever been told by a doctor or health professional that {you have/{he/she/SP} has} diabetes or sugar diabetes?Identification of the **features** for models is often an iterative and ongoing process.::: {#exm-data-leakage}Hypothetically speaking, any respondent that has diabetes, has a date at which they first learned that they had diabetes.Another expectation we might have is that: every respondent that reported a valid age at which they first learned of diabetes, also probably responded "Yes" to question DIQ010Within the context of predicting Diabetes: if we already know the respondent's age at which they first learned they had diabetes, then we expect the patient already has diabetes; this would be a good example of **data leakage**.:::## Features and Targets Vary by Prediction Task.If we instead chose a **prediction task** on "depression". Then our features, targets, and potential data-leakages all change.### Example : Depression```{r}#| label: open_DPQ_help_1 #| eval: falseOpen_NHANES_table_help('DPQ')``````{r}#| label: NHanes-depression #| echo: false#| fig-cap: "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DPQ_J.htm"#| out-width: '100%'knitr::include_graphics(here::here("images/NHANES_Depression.png"))```The **target** might be derived from one of the following:1. `DPQ010` - "Have little interest in doing things"2. `DPQ020` - "Feeling down, depressed, or hopeless"3. `DPQ060` - "Feeling bad about yourself"4. `DPQ090` - "Thought you would be better off dead"5. `DPQ100` - "Difficulty these problems have caused"Notice now, in this context, **Age at Diabetes** is no longer **leaking** into the relationship with the target of **Depression**.$~$------------------------------------------------------------------------$~$## Define Targets - Diabetes and Age at DiabetesBack to our primary example we will consider the task of modeling **Diabetics**.```{r}DIQ <-tbl(NHANES_DB, "DIQ")``````{r}#| eval: falseOpen_NHANES_table_help('DIQ')``````{r}#| label: NHanes-diabetes #| echo: false#| fig-cap: "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DIQ_J.htm"#| out-width: '100%'knitr::include_graphics(here::here("images/NHANES_DIQ_1.png"))```## Diabetes```{r}#| label: NHanes-diabetes-dr#| echo: false#| fig-cap: "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DIQ_J.htm"#| out-width: '100%'knitr::include_graphics(here::here("images/NHANES_DIQ_2.png"))```We will use the `DIQ010` column to identify members who have `DIABETES` :```{r}DM2_TBL <- DIQ %>%mutate(DIABETES =case_when(DIQ010 ==1~1, DIQ010 ==2~0, (DIQ010 !=1| DIQ010 !=2) ~NA) ) %>%select(SEQN, DIABETES)DM2_TBL %>%glimpse()```## Age at DiabetesSome features will be easier to define than others, as we shall see here. Above it was relatively easy to create a flag for `DIABETES`. However if we consider a similar but related data-point the Age at which the member first learned of their diagnosis it is more complicated:### Analytic Notes on DIQ```{r}#| label: NHanes-diabetes-analytic-notes#| echo: false#| fig-cap: "Analytic Notes"#| out-width: '100%'knitr::include_graphics(here::here("images/Screenshot 2021-03-16 200600.png"))```As per the Analytic notes on the DIQ table the "Age at Diabetes" has been mapped to different source variables over the number of years that the survey has been deployed. From the notes it appears that `AGE_AT_DIAG_DM2` is recorded across `DIQ040G`, `DIQ040Q`, `DID040G`, `DID040Q`, `DID040` depending on the `yr_range`.We might want to review each of the ranges in the above table to check our understanding of the data:### DIQ : Age at Diabetes : 1991-2000```{r}DIQ %>%select(SEQN, yr_range, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>%filter('1991-2000') %>%filter(!is.na(DIQ040G)) %>%glimpse()``````{r}#| eval: falseOpen_NHANES_table_help('DIQ','1999-2000')``````{r}#| label: age-at-dm2-pic#| echo: false#| fig-cap: "Age at Diabetes Diagnosis"#| out-width: '100%'knitr::include_graphics(here::here("images/Screenshot 2021-03-18 040556.png"))```So when `yr_range == 1991-2000` then `AGE_AT_DIAG_DM2` will be equal to `DIQ040Q`. However, when `DIQ040Q` is between 1 and 84 that is the age number. When `DIQ040Q` is `77777`, `99999` or missing we do not know the member's age number so we might as well classify all of these as missing.### DIQ : Age at Diabetes : 2001-2004The second range of years occurs over multiple year ranges. We might make use of the METADATA table to assist us with querying the DIQ table.For instance notice the METADATA table has a `start_yr` and `end_yr` numerical inputs on the same line associated with `yr_range` which is on the DIQ table:```{r}METADATA %>%filter(Valid_Table ==1) %>%filter(Table_Name =='DIQ') %>%select(yr_range_int, yr_range, start_yr, end_yr, Table_Name) ```So we can utilize the `start_yr` and `end_year` of the METADATA table in relation to the DIQ table:```{r}METADATA %>%filter(Valid_Table ==1) %>%# Where Valid_Table = 1 filter(Table_Name =="DIQ") %>%# where Table_Name = 'DIQ'filter(2001<= start_yr & end_yr <=2004) %>%# where 2001 <= start_yr AND end_yr <= 2004select(yr_range) %>%# select yr_rangesinner_join(DIQ) %>%# JOIN DIQselect(SEQN, yr_range, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>%# select columns of interestfilter(!(is.na(DIQ040G) &is.na(DIQ040Q) &is.na(DID040G) &is.na(DID040Q) &is.na(DID040))) %>%# one of these columns should be non-missingglimpse()```When `2001 <= start_yr & end_yr <= 2004` then `DID040Q` is the age number.We have a couple more to check that are very similar to this.We have two options as to how to proceed: 1. We can copy the above code 2x and modifying each one OR 2. we can make a helper function:## `AGE_AT_DIAG_VIEW` helper functionWe might want to standardize the above view as as an `R` function.Below we take in input `start_year` and `end_year`'s query the METADATA table for all valid DIQ tables between those values:```{r}AGE_AT_DIAG_VIEW <-function(my_start_year, my_end_year){if(is.null(my_start_year) |is.null(my_end_year)){print("ERROR : my_start_year and my_end_year must be numeric int \n") } tmp <- METADATA %>%filter(Valid_Table ==1) %>%filter(Table_Name =="DIQ") %>%filter(my_start_year <= start_yr & end_yr <= my_end_year) %>%select(yr_range) %>%inner_join(DIQ) %>%select(SEQN, yr_range, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>%filter(!(is.na(DIQ040G) &is.na(DIQ040Q) &is.na(DID040G) &is.na(DID040Q) &is.na(DID040))) return( tmp )}```Notice that writing this function amounts to replicating the code one time with the following changes:1. create a `function` called `AGE_AT_DIAG_VIEW` with `AGE_AT_DIAG_VIEW <- function(my_start_year, my_end_year){`2. we added some error checks on the inputs `my_start_year` and `my_end_year`3. we added `tmp` assignment to our `dplyr` string we found above4. replace `2001` with `my_start_year`5. replace `2004` with `my_end_year`6. we `return``tmp` and close the function definition.First, let's test it against some output we already are familiar with:```{r}AGE_AT_DIAG_VIEW(2001 , 2004) %>%glimpse()```Now checking the remaining options will require far less code:### DIG : Age at Diabetes : 2005-2008```{r}AGE_AT_DIAG_VIEW(2005 , 2008) %>%glimpse()```When `2005 <= start_yr & end_yr <= 2008` then `DID040` is the age number.### DIG : Age at Diabetes : 2009-2018```{r}AGE_AT_DIAG_VIEW(2009 , 2018)```When `2009 <= start_yr & end_yr <= 2018` then `DID040` is the age number.## Define `AGE_AT_DIAG_DM2`Now that we have confirmed each of the columns that contain the age at which the member reports first having diabetes we are one step closer to defining the feature.```{r}#| label: age-at-dm2-pic-2#| echo: false#| fig-cap: "Age at Diabetes Diagnosis Multiple Sources"#| out-width: '100%'knitr::include_graphics(here::here("images/Screenshot 2021-03-21 195037.png"))```We know each of the age numbers has ranges of valid and invalid values that shifts year-to-year. We believe the following code can account for many of the individual data issues without great loss of data quality until a more through review of each tables can be performed:```{r}AGE_AT_DIAG_DM2_TBL <- METADATA %>%filter(Valid_Table ==1) %>%filter(Table_Name =='DIQ') %>%select(yr_range, start_yr, end_yr) %>%left_join(DIQ) %>%select(SEQN, yr_range, start_yr, end_yr, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>%mutate(AGE_AT_DIAG_DM2_1 =case_when(1991<= start_yr & end_yr <=2000~ DIQ040Q,2001<= start_yr & end_yr <=2004~ DID040Q,2005<= start_yr & end_yr <=2018~ DID040)) %>%mutate(AGE_AT_DIAG_DM2 =case_when(is.na(AGE_AT_DIAG_DM2_1) ~NA , AGE_AT_DIAG_DM2_1 <1|84< AGE_AT_DIAG_DM2_1 ~NA,79< AGE_AT_DIAG_DM2_1 & AGE_AT_DIAG_DM2_1 <85~82,0< AGE_AT_DIAG_DM2_1 & AGE_AT_DIAG_DM2_1 <80~ AGE_AT_DIAG_DM2_1)) %>%select(SEQN, AGE_AT_DIAG_DM2)```In the first part of the query, from the `METADATA` to the `select(SEQN, yr_range, start_yr, end_yr, DIQ040G, DIQ040Q, DID040G, DID040Q, DID040) %>%` we are using the information we gathered from our helper function.In the next mutate statement, we review the above information we gathered in applying the helper function and map `DIQ040Q` , `DID040Q`, and `DID040` into `AGE_AT_DIAG_DM2_1` by the proper corresponding `start_yr` and `end_yr`.In the next mutate statement we make some educated guesses about the valid ranges of "Age at diagnosis":1. If it's missing, it's missing.2. If it's less then 1 or it's greater than 84 it's missing.3. If the value is greater than 79 and less then 85 we will just average out the errors over time by using `mean(79:85) =``r mean(79:85)`4. Lastly, if the value is greater than 0 but less than 80 we will assume it was entered be the age at diagnosis of diabetes.Lastly, we `select` only the information we need to avoid confusion.Here's a quick look at some non-missing values:```{r}AGE_AT_DIAG_DM2_TBL %>%filter(!is.na(AGE_AT_DIAG_DM2)) %>%glimpse()```## Outcome Table```{r }DEMO <- tbl(NHANES_DB, "DEMO")``````{r def_OUTCOME_TBL }OUTCOME_TBL <- DEMO %>% select(SEQN) %>% left_join(DM2_TBL) %>% left_join(AGE_AT_DIAG_DM2_TBL) ```## Define Features - Gender and Age```{r }#| eval: falseOpen_NHANES_table_help('DEMO')``````{r}INPUT_TBL <- DEMO %>%mutate(Gender =case_when(RIAGENDR ==2~"Female", RIAGENDR ==1~"Male", RIAGENDR !=2| RIAGENDR !=1~NA)) %>%mutate(Age =if_else(is.na(RIDAGEYR), RIDAGEMN/12 , RIDAGEYR)) %>%select(SEQN, yr_range, Age, Gender) %>%glimpse()```## Analytic Data-SetsAn **analytic data-set** primarily consists of columns of **targets** and **features** to analyze the data or complete a prediction task.**Note** there may be other useful columns that are assigned other roles in the predictive modeling tasks.1. **surrogate keys** or other **identifiers**- example : `SEQN` is the surrogate key for a patient ID - each respondent has a unique `SEQN`2. **grouping** or **partitioning** columns- example : `yr_range` the survey is taken in year ranges and there are some shifts year-to-year in column information3. **raw** or **source** columns are also permitted- we might want to experiment with various definition of a feature *"on-the-fly"* so it might be helpful to keep some source information while we iterate on our analysis set.Ultimately, The data science team is responsible to ensure all data-inputs into predictive models will be read as valid features, which will not heavily leak into the model prediction task. The goal of having analytic data-sets is to make analysis easier for researchers to understand and iterate on the overall data science pipeline.## ExampleThe Table `A_DATA_TBL` below is a simple example of an analytic data-set:```{r}A_DATA_TBL <- INPUT_TBL %>%left_join(OUTCOME_TBL)``````{r}A_DATA_TBL %>%glimpse()```## Example Summary Tables with `dplyr`:### Count of Diabetes```{r }A_DATA_TBL %>% group_by(DIABETES) %>% summarise(n = n_distinct(SEQN))A_DATA_TBL %>% group_by(DIABETES) %>% tally()```### Average Age by Gender```{r}A_DATA_TBL %>%group_by(Gender) %>%summarise(mean_Age =mean(Age))```We see that `R` gives us a warning about what is happening with the `NA` values. If we want to let `R` know that we encourage this behavior we use:```{r}A_DATA_TBL %>%group_by(Gender) %>%summarise(mean_Age =mean(Age, na.rm=TRUE))```### Counts of Diabetic Status by Gender```{r}A_DATA_TBL %>%group_by(DIABETES, Gender) %>%tally() ```We can also continue to use `dplyr` after a summary table; here we use `pivot_wider` which is in the `tidyr` library, here, `pivot_wider` is creating new column names from the discrete values in the `Gender` column (`names_from = Gender`), the values that populate those entries will come from the `n` in the `tally` this specified with the `values_from = n` below:```{r}A_DATA_TBL %>%group_by(DIABETES, Gender) %>%tally() %>% tidyr::pivot_wider(names_from = Gender , values_from = n)```### Count of Female patients with Diabetes```{r }A_DATA_TBL %>% group_by(DIABETES, Gender) %>% tally() %>% filter(DIABETES == 1) %>% filter(Gender == "Female")``````{r }A_DATA_TBL %>% filter(DIABETES == 1) %>% filter(Gender == "Female") %>% tally() ```### Counts and Mean Age of Member by Diabetic Status and GenderNote that we can specify the `.groups` option in a `summarise` on a grouped dataframe the options are:- "keep": Same grouping structure as the grouped data.- "drop_last": dropping the last level of grouping.- "drop": All levels of grouping are dropped.- "rowwise": Each row is its own groupWhen `.groups` is not specified, it is chosen based on the number of rows of the results:- If all the results have 1 row, you get "drop_last".- If the number of rows varies, you get "keep".```{r}A_DATA_TBL %>%group_by(DIABETES, Gender) %>%summarise(n =n_distinct(SEQN),Mean_Age =mean(Age, na.rm=TRUE),.groups ='keep')```Again, we can reformat the table with `pivot_wider` - this time we will get values from both `n` and `Mean_Age`:```{r}A_DATA_TBL %>%group_by(DIABETES, Gender) %>%summarise(n =n_distinct(SEQN),Mean_Age =mean(Age, na.rm=TRUE),.groups ='keep') %>%ungroup() %>%pivot_wider(names_from = Gender , values_from =c('n', 'Mean_Age'))```### Average Age of Male with Diabetes```{r }A_DATA_TBL %>% group_by(DIABETES, Gender) %>% summarise(Mean_Age = mean(Age, na.rm=TRUE), .groups = 'keep') %>% filter(Gender == 'Male') %>% filter(DIABETES == 1)``````{r }A_DATA_TBL %>% filter(Gender == 'Male') %>% filter(DIABETES == 1) %>% summarise(Mean_Age = mean(Age, na.rm=TRUE), .groups = 'keep') ```### By Gender and Diabetic condition, how many patients are younger than average among their group?```{r}Mean_Age.by_DM2_Gender <- A_DATA_TBL %>%group_by(DIABETES, Gender) %>%summarise(n =n_distinct(SEQN),Mean_Age =mean(Age, na.rm=TRUE),.groups ='keep')A_DATA_TBL %>%left_join(Mean_Age.by_DM2_Gender) %>%filter(Age < Mean_Age) %>%group_by(DIABETES, Gender) %>%summarise(n =n_distinct(SEQN),.groups ='keep')```### How many people are in the top quartile of Age?Use the `ntile` function with `n = 4` for "quartile" :```{r}A_DATA_TBL %>%mutate(ntile_4 =ntile(Age,4)) %>%filter(ntile_4 ==4) %>%tally()```### How many Males with Diabetes are in the bottom quartile of Age?```{r }A_DATA_TBL %>% mutate(ntile_4 = ntile(Age, 4)) %>% group_by(DIABETES, Gender, ntile_4) %>% tally() %>% ungroup() %>% filter(ntile_4 == 1) %>% filter(DIABETES == 1) %>% filter(Gender == 'Male') %>% select(n)``````{r }A_DATA_TBL %>% mutate(ntile_4 = ntile(Age, 4)) %>% filter(ntile_4 == 1) %>% filter(DIABETES == 1) %>% filter(Gender == 'Male') %>% tally() ``````{r }A_DATA_TBL %>% mutate(ntile_4 = ntile(Age,4)) %>% group_by(DIABETES, Gender, ntile_4) %>% tally() %>% filter(ntile_4 == 1) %>% filter(DIABETES == 1) %>% filter(Gender == 'Male')```### Cumulative Member count by `yr_range````{r }A_DATA_TBL %>% mutate(is_person = if_else(SEQN > 0 , 1 , 0 )) %>% group_by(yr_range) %>% summarise(n_mbrs_per_yr_range = sum(is_person), .groups = 'keep') %>% arrange(yr_range) %>% mutate(cum_mbrs = cumsum(n_mbrs_per_yr_range)) ```## Data from ConnectionsUp until this point we have utilized `dplyr` and `dbplyr` to interface with a `sqlite` file.Ultimately, `R` will be somewhat limited in what it can do only utilizing SQL connections to data-bases. Different data-bases will have different functionalities and slight variants of SQL; and what you are able to accomplish from a [Tera-Data](https://cran.r-project.org/web/packages/dplyr.teradata/index.html "dplyr teradata") database might be different from what you can accomplish using a [Spark](https://spark.rstudio.com/ "sparklyr") or `sqlite` connection.For instance, some [window functions](https://dplyr.tidyverse.org/articles/window-functions.html "dplyr window functions") **may not** [work as expected with our example sqlite database](https://db.rstudio.com/dplyr/#creating-your-own-database "dplyr window functions might fail in sqlite").Learning these various technical challenges that accompanies each context arises from working within or between them.Explicitly, notice that:```{r }str(A_DATA_TBL, 1)```The object `A_DATA_TBL` is a connection to our SQLite database, not an `R` data-frame-tibble.In fact, if we want the SQL we created along the way we can export it using the `show_query` function:```{r }A_DATA_TBL %>% show_query()```### `collect` what you needOften, it is needed to `collect` data into the `R` environment to perform more detailed analysis than the SQL is capable of. Unless you are working with `sparklyr`, many `R` modeling packages will require the input data to be an `R` data-frame.```{r}A_DATA <- A_DATA_TBL %>%collect()```Now check the `str` of `A_DATA`:```{r}str(A_DATA)```and we can see we have a dataframe-tibble as a result.Notice, since we have downloaded a data-set of dimension:```{r }dim(A_DATA)```Additionally, notice we may not get useful information if we attempt to apply the `dim` function to `A_DATA_TBL` since `A_DATA_TBL` is not a dataframe:```{r}dim(A_DATA_TBL)```We still may be able to replicate the information we are looking for with various `dplyr` queries to the SQL back-end:```{r}# number of rows A_DATA_TBL %>%tally()# number of columnsA_DATA_TBL %>%colnames() %>%length()```## Save Week 2 Analysis Data {#sec-save-week-2-analysis-data}We will save the data-frame for the moment and continue our investigation.```{r}A_DATA %>%saveRDS(here::here('DATA/Part_2/A_DATA.RDS'))```## Known sqlite limitations**Remark** thus far my attempts to provide a **bad window function example** per have been fruit-less thus far, however, I have found it doesn't like all join types:```{r}#| error: trueERROR <- DM2_TBL %>%full_join(AGE_AT_DIAG_DM2_TBL)ERROR %>%glimpse()```**Remark** we have already seen additional `R` functions that will *"not translate to SQL"* for example the `hist` function:```{r }hist(A_DATA$AGE_AT_DIAG_DM2, main = "Histogram of Age at Diabetes", xlab = "Age at Diabetes")``````{r}#| error: truehist(A_DATA_TBL$AGE_AT_DIAG_DM2 ,main ="Histogram of Age at Diabetes",xlab ="Age at Diabetes")```However, we can still use `R` to a connected source without downloading large amounts of data:```{r}#| error: truehist( (A_DATA_TBL %>%select(AGE_AT_DIAG_DM2) %>%collect())$AGE_AT_DIAG_DM2 ,main ="Histogram of Age at Diabetes",xlab ="Age at Diabetes")```For instance if we wanted this graph:```{r}A_DATA %>%filter(!is.na(DIABETES)) %>%mutate(DIABETES_factor =as.factor(DIABETES)) %>%ggplot(aes(x=Age, fill=DIABETES_factor)) +geom_density() +labs(title ="Density Plot - Diabetes by Age")```without having to download all of the data, we could do something like:```{r}#| cache: falseA_DATA_TBL %>%filter(!is.na(DIABETES)) %>%select(DIABETES, Age) %>%# select needed variables collect() %>%# collect only the data we want mutate(DIABETES_factor =as.factor(DIABETES)) %>%#as.factor is an r function, SQLlite doesn't have "factor data type"ggplot(aes(x=Age, fill=DIABETES_factor)) +geom_density() +labs(title ="Density Plot - Diabetes by Age")```