30 Principal Component Analysis

30.1 Take a quick look at the dataset

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggbiplot)
glimpse(mtcars)

Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

30.2 Compute the principal components

Here, we set center and scale to ‘TRUE’ so R will center and scale our data (i.e., standardize our data).

You can see there are 9 principal components, with each explaining a proportion of the variability in the data. For example, PC1 explains 62.8% of the total variance: that is, almost two-thirds of the variance in the entire dataset can be explained by the first principal component.

cars.pca <- prcomp(mtcars[,c(1:7,10:11)], 
                     center=TRUE,
                     scale=TRUE)
summary(cars.pca)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
                          PC8     PC9
Standard deviation     0.2419 0.14896
Proportion of Variance 0.0065 0.00247
Cumulative Proportion  0.9975 1.00000

30.3 Plot Principal Components

The biplot is used to visualize principal components. This plots the first and second principal components. The closer the variable is the the center, the less contribution that variable has to either principal component.

You can see that hp, cyl, and disp contribute to the first principal component because they are farthest to the right. But, this plot isn’t informative because we just have the relationship of each variable to the axes.

ggbiplot(cars.pca)

30.3.1 Add labels to the points

Here, we added labels to the points to see the car to which each point corresponds.

ggbiplot(cars.pca, labels=rownames(mtcars))

30.3.2 Add country of origin groupings

We can also add a group for origin and you will now see that the American-made cars cluster to the right and are generally characterized by high values of cyl (# cylinders), disp (displacement), and wt (weight).

Japanese cars tend to be characterized by high mpg (miles per gallon).

origin <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))

ggbiplot(cars.pca,ellipse=TRUE,  labels=rownames(mtcars), groups=origin)

30.4 How can we determine the number of principal components?

We aim to find the components with the maximum variance so we can retain as much information about the original dataset as possible.

To determine the number of principal components to retain in our analysis we need to compute the proportion of variance explained.

We can plot the cumulative proportion of variance explained in a scree plot to determine the number of principal components to retain.

# Pull SD
sd.cars <- cars.pca$sdev

# Compute variance
var.cars <- sd.cars^2

# Proportion of variance explained
prop.cars <- var.cars/sum(var.cars)

# Plot the cumulative proportion of variance explained
plot(cumsum(prop.cars), 
     xlab='Principal components', 
     ylab='Proportion of variance explained', 
     type='b')