a basic comparison between factor analysis, pca, and ica

A Basic Comparison Between Factor Analysis, PCA, and ICA

14 September 2021

This is just a very quick blog post outlining some of the commonalities and differences between factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA). I was inspired to write some of this down through some confusion caused in the lab by SPSS’ apparent dual usage of the term “factor analysis” and “principal components”. A few of my colleagues who use SPSS showed me the following screen:

If you’re wondering if you’re the first one to be confused by SPSS’ choice to present this in such a way, you’re not (link, link).

This screen shows up when you click Analyze -> Dimension Reduction -> Factor, which then opens a window called “Factor Analysis: Extraction” which lets you pick “Principal components” as a method. To put the (apparent) PCA method as a sub-section of factor analysis is misleading at best, and straight-up erronious at worst. The other options for the method here is “Principal axis factoring” (which is closer to traditional factor analysis) and “Maximum likelihood” (source for screenshot and SPSS interface).

Standard disclaimer: I’m not a statistician, and I’m definitely not confident enough to go in-depth into the mathematics of the different algorithms. Instead, I’ll just run three common latent variable modeling/clustering methods, and show the difference in results when applied to the same data. Where-ever I feel confident, I will also elaborate on the underlying mathematical principles and concepts. It’s a short post, and I’m sure there’s levels of nuance and complexity I’ve missed. Please let me know if you think I’ve committed a major oversight.

To show the methods I’ll use a small dataset on cervical cancer risk factors with only 72 entries and 20 variables. This dataset I downloaded from the UCI Machine Learning Repository and was based on this paper by Sobar et al. The dataset consists of 19 risk factors for cervical cancer, and 1 outcome variable (binary outcome variable with yes/no cervical cancer).

We’ll load a few packages first, {tidyverse} for data wrangling and visualization, {fastICA} to get access to an ICA algorithm, and {patchwork} to combine different plots into one.

library(tidyverse)
library(fastICA)
library(patchwork)

set.seed(21)

theme_set(theme_minimal())

We’ll load the data into R, clean up the variable names, convert the outcome variable (ca_cervix) to a factor. We’ll have a look at the dataset using some functions from {skimr}. This function will give us summary statistics and basic histograms of the different variables.

data <- read_csv("./data/sobar-72.csv") |>
  janitor::clean_names() |>
  mutate(ca_cervix = as_factor(ca_cervix))

skim_summ <- skimr::skim_with(base = skimr::sfl())
skim_summ(data)


Name	data
Number of rows	72
Number of columns	20
_______________________
Column type frequency:
factor	1
numeric	19
________________________
Group variables	None

Data summary

Variable type: factor

skim_variable	ordered	n_unique	top_counts
ca_cervix	FALSE	2	0: 51, 1: 21

Variable type: numeric

skim_variable	mean	sd	p0	p25	p50	p75	p100	hist
behavior_sexual_risk	9.67	1.19	2	10.00	10.0	10.00	10	▁▁▁▁▇
behavior_eating	12.79	2.36	3	11.00	13.0	15.00	15	▁▁▂▅▇
behavior_personal_hygine	11.08	3.03	3	9.00	11.0	14.00	15	▁▂▆▆▇
intention_aggregation	7.90	2.74	2	6.00	10.0	10.00	10	▂▁▂▁▇
intention_commitment	13.35	2.37	6	11.00	15.0	15.00	15	▁▁▂▁▇
attitude_consistency	7.18	1.52	2	6.00	7.0	8.00	10	▁▁▃▇▂
attitude_spontaneity	8.61	1.52	4	8.00	9.0	10.00	10	▁▂▁▅▇
norm_significant_person	3.12	1.85	1	1.00	3.0	5.00	5	▇▁▂▁▇
norm_fulfillment	8.49	4.91	3	3.00	7.0	14.00	15	▇▁▁▂▆
perception_vulnerability	8.51	4.28	3	5.00	8.0	13.00	15	▇▅▅▂▇
perception_severity	5.39	3.40	2	2.00	4.0	9.00	10	▇▂▁▂▅
motivation_strength	12.65	3.21	3	11.00	14.0	15.00	15	▁▁▁▂▇
motivation_willingness	9.69	4.13	3	7.00	11.0	13.00	15	▅▃▃▃▇
social_support_emotionality	8.10	4.24	3	3.00	9.0	11.25	15	▇▂▃▃▅
social_support_appreciation	6.17	2.90	2	3.75	6.5	9.00	10	▇▃▅▇▇
social_support_instrumental	10.38	4.32	3	6.75	12.0	14.25	15	▅▂▂▃▇
empowerment_knowledge	10.54	4.37	3	7.00	12.0	15.00	15	▃▁▂▂▇
empowerment_abilities	9.32	4.18	3	5.00	10.0	13.00	15	▇▃▅▃▇
empowerment_desires	10.28	4.48	3	6.75	11.0	15.00	15	▅▁▃▃▇

Now let’s only select the variables containing the risk factors (called features from here on). We’ll also scale all the features to have a mean of 0 and a standard deviation of 1 using the scale() function. We can check what the new variable looks like using the summary() function.

data_features <- data |>
  select(-ca_cervix) |>
  mutate(across(everything(), ~ scale(.x)))

summary(data_features$attitude_consistency)

       V1         
 Min.   :-3.4019  
 1st Qu.:-0.7752  
 Median :-0.1186  
 Mean   : 0.0000  
 3rd Qu.: 0.5381  
 Max.   : 1.8514

So now we have a data frame with 72 entries and 19 normalized columns, each representing a feature that may or may not predict cervical cancer. We can create a correlation matrix to visualize the degree of correlation between the different features. For this we simply run cor() on the data frame with the features, transform the output to a data frame in long format and then visualize it using ggplot() and geom_tile().

cor(data_features) |>
  as_tibble() |>
  mutate(feature_y = names(data_features)) |>
  pivot_longer(
    cols = -feature_y,
    names_to = "feature_x",
    values_to = "correlation"
  ) |>
  mutate(feature_y = fct_rev(feature_y)) |>
  ggplot(aes(x = feature_x, y = feature_y, fill = correlation)) +
  geom_tile() +
  labs(
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
  coord_fixed() +
  theme(axis.text.x = element_text(hjust = 1, angle = 30))

What we’re looking for here are little blocks along the diagonal. This would indicate variables that correlate with each other. We can easily spot three blocks, and we can quietly expect these variables to cluster together later, although perhaps not necessarily in the three blocks visible here.

What we showed here was the correlation matrix, if we replaced the cor() function with the cov() function we’d get the covariance matrix, which would look basically identical in this case.

Let’s now dive into the approaches we’ll use. We’ll discuss three here, and if there’s space, I might add a fourth one briefly. We’ll definitely discuss factor analysis, PCA, and ICA, and we’ll start with the former.

Selecting the number of components

Icasso is implemented in MATLAB, which I won’t bore you with here, but it seems someone created a Python port also

For factor analysis and ICA we need to provide the number of factors we want to extract. We’ll use the same number of factors/components throughout this tutorial (also for PCA). Selection of the optimal number of factors/components is a fairly arbitrary process which I won’t go into now. In short, before writing this I ran PCA and a tool called icasso. Icasso runs an ICA algorithm a number of times and provides a number of parameters on the stability of the clusters at different thresholds, this approach is very data-driven. A more common and easier way to get some data-driven insight into the optimal number of clusters is using the “elbow”-method using PCA, eigenvalues of the components, and the cumulative variance explained by the components (we’ll show those later). However, in the end, you should also look at the weight matrix of the different cluster thresholds and it becomes a fairly arbitrary process. In this case, the icasso showed that 7 components was good (but not the best), the weight matrix looked okay, and 7 components explained more than 80% of the variance in the PCA. I went for 7 components in the end also because it served the purpose of this blogpost quite well, but I think different thresholds are valid and you could make a case for different ones based on the data, based on the interpretation of the weight matrix, etc. This process is a bit of an art and a science combined.

n_comps <- 7

Factor Analysis

So, let’s run the factor analysis. Technically speaking, factor analysis isn’t a clustering method but rather a latent variable modeling method (source). The primary aim of a factor analysis is the reconstruction of correlations/covariances between variables. Maximizing explained variance of the factors is only a secondary aim and we’ll get to why that is relevant in the PCA section.

We’ve established we want 7 factors. The factor analysis method is implemented in R through the factanal() function (see here. This function applies a common factor model using the maximum likelihood method. We’ll simply provide it with our data frame, the number of factors we want to extract, and we’ll ask to provide Bartlett’s weighted least-squares scores as well. We can apply a “rotation” to the factors to make them reduce the complexity of the factor loadings and make them easier to interpret, here we’ll use the varimax option. Varimax maximizes the sum of the variances of the squared loadings. Then we’ll print the model to see the results (and sort so the loadings are ordered). The output may be a bit long.

fa_model <- factanal(
  data_features,
  factors = n_comps,
  scores = "Bartlett",
  rotation = "varimax"
)

print(fa_model, sort = TRUE)

Call:
factanal(x = data_features, factors = n_comps, scores = "Bartlett",     rotation = "varimax")

Uniquenesses:
       behavior_sexual_risk             behavior_eating 
                      0.749                       0.005 
   behavior_personal_hygine       intention_aggregation 
                      0.287                       0.504 
       intention_commitment        attitude_consistency 
                      0.797                       0.805 
       attitude_spontaneity     norm_significant_person 
                      0.758                       0.392 
           norm_fulfillment    perception_vulnerability 
                      0.129                       0.170 
        perception_severity         motivation_strength 
                      0.143                       0.643 
     motivation_willingness social_support_emotionality 
                      0.354                       0.005 
social_support_appreciation social_support_instrumental 
                      0.005                       0.251 
      empowerment_knowledge       empowerment_abilities 
                      0.005                       0.166 
        empowerment_desires 
                      0.207 

Loadings:
                            Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
motivation_willingness       0.655           0.387           0.115  -0.135 
social_support_emotionality  0.843           0.258          -0.196         
social_support_appreciation  0.849  -0.138                          -0.487 
social_support_instrumental  0.816  -0.211                   0.140         
empowerment_knowledge        0.798           0.347           0.141   0.433 
empowerment_abilities        0.861   0.118   0.188                   0.201 
empowerment_desires          0.814                           0.345         
norm_significant_person     -0.167   0.724   0.169                         
norm_fulfillment                     0.904   0.122                         
perception_vulnerability     0.124   0.879                   0.116         
perception_severity                  0.903                                 
behavior_personal_hygine     0.289   0.196   0.714   0.147  -0.154  -0.102 
intention_aggregation                        0.664           0.208         
behavior_eating                                      0.987                 
behavior_sexual_risk         0.159   0.135          -0.150   0.427         
intention_commitment                         0.174   0.111   0.394         
attitude_consistency                 0.181                  -0.103         
attitude_spontaneity                        -0.149   0.302           0.335 
motivation_strength          0.244   0.111   0.477  -0.195   0.101         
                            Factor7
motivation_willingness      -0.135 
social_support_emotionality -0.413 
social_support_appreciation        
social_support_instrumental        
empowerment_knowledge        0.172 
empowerment_abilities              
empowerment_desires                
norm_significant_person     -0.146 
norm_fulfillment             0.153 
perception_vulnerability     0.136 
perception_severity          0.153 
behavior_personal_hygine     0.162 
intention_aggregation              
behavior_eating                    
behavior_sexual_risk               
intention_commitment               
attitude_consistency         0.371 
attitude_spontaneity               
motivation_strength                

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
SS loadings      4.798   3.128   1.689   1.208   0.652   0.631   0.520
Proportion Var   0.253   0.165   0.089   0.064   0.034   0.033   0.027
Cumulative Var   0.253   0.417   0.506   0.570   0.604   0.637   0.665

Test of the hypothesis that 7 factors are sufficient.
The chi square statistic is 62.23 on 59 degrees of freedom.
The p-value is 0.362

We may be tempted to immediately look at the p-value at the end of the output. This p-value denotes whether the assumption of perfect fit can be rejected. If this p-value is below 0.05 or 0.01 we can reject the hypothesis of perfect fit, meaning that we could probably try a different method or try a different number of factors. In this case, the p-value is larger than 0.05 so we cannot reject the hypothesis of perfect fit.

The “Loadings” section of the results show a make-shift weight matrix, but in order to further interpret these results, let’s create a plot showing the weight matrix. We’ll get the results from the factor analysis model we created earlier using the tidy() function from the {broom} package and convert it to long format. We’ll then create a weight matrix much in the same way we did earlier.

fa_weight_matrix <- broom::tidy(fa_model) |>
  pivot_longer(starts_with("fl"),
    names_to = "factor", values_to = "loading"
  )

fa_loading_plot <- ggplot(
  fa_weight_matrix,
  aes(x = factor, y = variable, fill = loading)
) +
  geom_tile() +
  labs(
    title = "FA loadings",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
  coord_fixed(ratio = 1 / 2)

print(fa_loading_plot)

Here we can more easily see that there’s two strong clusters in Factor 1 and Factor 2. Factor 3 captures three or four variables to a lesser degree and Factor 4 captures mostly just a single variable ("behavior_eating"). Interpretation of these factors is subjective, so it is perhaps best done in collaboration with others.

Lastly, I think it would be interesting to see how the different factors relate to each other. We’ll take the Bartlett’s scores and correlate them with each other much like before and create a correlation matrix like before.

fa_model$scores |>
  cor() |>
  data.frame() |>
  rownames_to_column("factor_x") |>
  pivot_longer(
    cols = -factor_x,
    names_to = "factor_y", values_to = "correlation"
  ) |>
  ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 4)), color = "white") +
  labs(
    title = "Correlation between FA scores",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
  coord_equal()

We can see little correlation but certainly a few non-zero correlations also. In particular Factor 5 and Factor 7 seem to correlate to some extent at least and there are a few others with some minor correlations, e.g. Factor 1 and Factor 3, as well as Factor 5 and Factor 6. We’ll compare this later with the other algorithms.

Principal Component Analysis

Principal component analysis can reliably be classfied as a clustering method (as opposed to factor analysis) and it is a very common dimensionality reduction approach. The primary aim of PCA is to maximize variance captured while creating uncorrelated (orthogonal) components. Some methods associated with PCA have been borrowed from factor analysis (e.g. scree plot, Kaiser’s rule, source). A paper from professor of Mathematics at the University of Essex Ian Jolliffe published in Encyclopedia of Statistics in Behavioral Science uses fairly strong terms to make the distinction clear:

This is partially caused by a number of widely used software packages treating PCA as a special case of factor analysis, which it most certainly is not. There are several technical differences between PCA and factor analysis, but the most fundamental difference is that factor analysis explicitly specifies a model relating the observed variables to a smaller set of underlying unobservable factors. Although some authors express PCA in the framework of a model, its main application is as a descriptive, exploratory technique, with no thought of an underlying model. [source]

I can recommend this paper as a great primer to PCA methods. It goes over a few concepts very relevant for PCA methods as well as clustering methods in general.

For some more info on the difference between prcomp() and princomp() see this thread on CrossValidated

Now, let’s run the PCA. In R there’s two functions built-in to run a principal component analysis, here we’ll use the prcomp() function (the other being princomp(), but prcomp() is preferred for reasons beyond the scope of this post). The prcomp() function contains a few options we can play with, but in my experience it works fine out of the box if you’ve normalized the data beforehand (link to documentation). So we’ll simply provide the data frame with the features. In addition, we’ll also calculate the variance explained by each component. We do that by simply taking the standard deviation calculated by the PCA and squaring it, we’ll save it in a new field called variance.

pc_model <- prcomp(data_features)

pc_model$variance <- pc_model$sdev^2

Next we can make a simple scree plot using the variance we calculate above. We’ll create a plot with the principal components on the x-axis and the eigenvalue on the y-axis. The scree plot is a very popular plot to visualize features of a PCA. In this plot the elbow is quite clearly at principal component 3, but as discussed, the scree plot is not the best nor the only way to determine the optimal number of components. In the code below I also added a calculation for the cumulative variance (cum_variance) which showed that a little more than 80% of the variance is captured in the first 7 components, while the first 3 components combined capture only 56%.

pc_model$variance |>
  as_tibble() |>
  rename(eigenvalue = value) |>
  rownames_to_column("comp") |>
  mutate(
    comp = parse_number(comp),
    cum_variance = cumsum(eigenvalue) / sum(eigenvalue)
  ) |>
  ggplot(aes(x = comp, y = eigenvalue)) +
  geom_hline(yintercept = 1) +
  geom_line(linewidth = 1) +
  geom_point(size = 3)

We’ll also create a weight matrix again, based on the rotation from the PCA. We’ll create the weight matrix much in the same way as before. A PCA by its very nature returns an equal number of components as the number of variables put in, however, we’re interested in just the first 7 components, so we’ll select just those using the filter() function.

pc_weight_matrix <- pc_model$rotation |>
  data.frame() |>
  rownames_to_column("variable") |>
  pivot_longer(starts_with("PC"),
    names_to = "prin_comp", values_to = "loading"
  )

pca_loading_plot <- pc_weight_matrix |>
  filter(parse_number(prin_comp) <= n_comps) |>
  ggplot(aes(
    x = reorder(prin_comp, parse_number(prin_comp)),
    y = variable, fill = loading
  )) +
  geom_tile() +
  labs(
    title = "PCA loadings",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
  coord_fixed(ratio = 1 / 2)

print(pca_loading_plot)

One thing that immediately jumps out is that PC1 and PC2 are nearly identical to Factor 1 and Factor 2 from the factor analysis. The direction of the sign might be reversed, but we can fix this visually by multiplying the loadings by -1. The more important part though is the other components, which look nothing like the factors from the factor analysis. With some effort you could see some overlap between Factor 3 and PC4 as well as Factor 5 and PC 5, but both factor analysis and PCA had different ideas about how to group the variables together other than the first two factors/components.

We can also make a correlation matrix for the different principal components. We’ll use the x field and otherwise create the correlation matrix the same way as before:

pc_model$x |>
  cor() |>
  data.frame() |>
  rownames_to_column("comp_x") |>
  pivot_longer(
    cols = starts_with("PC"),
    names_to = "comp_y", values_to = "correlation"
  ) |>
  filter(
    parse_number(comp_x) <= n_comps,
    parse_number(comp_y) <= n_comps
  ) |>
  ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 4)), color = "white") +
  labs(
    title = "Correlation between PCs",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
  coord_equal()

What jumps out here is the apparent absence of correlation. By its design the principal components are orthogonal to each other. The PCA works by finding a direction (a vector) that maximizes correlation of the features (capturing maximum variance). When it’s done that, it moves to the next direction, orthogonal to the previous and maximizes correlation across that direction, and so on. Since the degree of correlation reduces as you account for more and more of the correlations in your data, the amount of correlation still unaccounted for reduces as you move along the components. Now you can replace the word “correlation” in these past few sentences with “covariance” and it would still hold up to a large degree apart from the diagonal (again, caveats apply).

Next, we can also visualize the PCA using a plot called a “biplot”. It visualizes the loading of the first two components (e.g. PC1 on the x-axis and PC2 on the y-axis) and also visualizes the direction across which correlation was maximized. It’s simply the biplot() function implemented already in R.

biplot(pc_model)

It isn’t very pretty, but we can roughly see the directions across the first two components, one pointing straight down and one pointing to the left.

A nice feature of PCA is that the underlying principles are rather simple, and you could calculate the eigenvectors and eigenvalues using just two functions: cov() and eigen(), which calculate the covariance matrix and the eigenvalues and -vectors of the covariance matrix respectively. If the terms “matrix manipulations”, eigenvectors, and dot product sound scary to you, you can skip to the ICA section. We can then calculate the individual loadings (stored in the prcomp() model in the x field) by calculating the dot product of the data features and the eigenvector. Dot product in this case means multiplying every element of a column with the corresponding element in the eigenvector.

cov_matrix <- cov(data_features)

eigen_model <- eigen(cov_matrix)
eigenvalues <- eigen_model$values
eigenvector <- eigen_model$vectors

pc_manual <- as.matrix(data_features) %*% eigenvector

Let’s look at the scree plot:

tibble(eigenvalues) |>
  ggplot(aes(x = seq_along(eigenvalues), y = eigenvalues)) +
  geom_hline(yintercept = 1) +
  geom_line(linewidth = 1) +
  geom_point(size = 3)

Looks identical to the previous one. Let’s also look at the correlation matrix between the principal components.

data.frame(pc_manual) |>
  cor() |>
  round(x = _, digits = 4)

    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
X1   1  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
X2   0  1  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
X3   0  0  1  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
X4   0  0  0  1  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
X5   0  0  0  0  1  0  0  0  0   0   0   0   0   0   0   0   0   0   0
X6   0  0  0  0  0  1  0  0  0   0   0   0   0   0   0   0   0   0   0
X7   0  0  0  0  0  0  1  0  0   0   0   0   0   0   0   0   0   0   0
X8   0  0  0  0  0  0  0  1  0   0   0   0   0   0   0   0   0   0   0
X9   0  0  0  0  0  0  0  0  1   0   0   0   0   0   0   0   0   0   0
X10  0  0  0  0  0  0  0  0  0   1   0   0   0   0   0   0   0   0   0
X11  0  0  0  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0
X12  0  0  0  0  0  0  0  0  0   0   0   1   0   0   0   0   0   0   0
X13  0  0  0  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0
X14  0  0  0  0  0  0  0  0  0   0   0   0   0   1   0   0   0   0   0
X15  0  0  0  0  0  0  0  0  0   0   0   0   0   0   1   0   0   0   0
X16  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   1   0   0   0
X17  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   1   0   0
X18  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   1   0
X19  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   1

Yup, also still 0 across the board. Calculating a PCA by just using matrix manipulations isn’t too complicated, and it’s a fun exercise, but I’d recommend just sticking to the prcomp() function, since it’s a lot simpler and offer better functionality. Now, let’s move on to the independent component analysis.

Independent Component Analysis

If you’re feeling particularly mathematical, you can read find a paper here that compares different algorithms

While PCA attempts to find components explaining the maximum degree of covariance or correlation, an ICA attemps to find components with maximum statistical independence. There’s very complicated nuance here where PCA components are orthogonal and uncorrelated to each other and ICA components are merely statistically independent, which is a very subtle nuance. In practice, it’ll mean that ICA components also have a practically zero correlation. The main difference is in how the components are obtained. Like with factor analysis, most ICA algorithms require you to provide a number of components up front. The FastICA algorithm we’ll use here is a version of an ICA implementation. There’s also InfoMax and JADE to name two other implementations. I couldn’t tell you the difference between these ICA algorithms. The FastICA is implemented in R through the fastICA() function and the eponymous {fastICA} package (link to documentation).

ica_model <- fastICA(data_features, n.comp = n_comps)

Let’s create a weight matrix again. The output from the fastICA() doesn’t provide useful names, but the documentation provides sufficient information. To create the weight matrix we take the A field, transpose it, get the right names to the right places and then create the plot like we’ve done several times now.

ica_weight_matrix <- data.frame(t(ica_model$A)) |>
  rename_with(~ str_glue("IC{seq(.)}")) |>
  mutate(variable = names(data_features)) |>
  pivot_longer(
    cols = starts_with("IC"),
    names_to = "ic", values_to = "loading"
  )

ica_loading_plot <- ggplot(
  ica_weight_matrix,
  aes(x = ic, y = variable, fill = loading)
) +
  geom_tile() +
  labs(
    title = "ICA loadings",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
  coord_fixed(ratio = 1 / 2)

print(ica_loading_plot)

The FastICA method doesn’t rank the components based on variance like factor analysis and PCA. Degree of variance explained is not a part of the FastICA algorithm. The ordering of the clusters is not meaningful, we could label IC1 as IC6 and vice versa and the result would not be less valid. So with that in mind, we can see that two of the components (IC6 and IC4) are again very similar to the first two components we saw in the factor analysis and the PCA. These appear to be very robust components and will cluster together regardless of the method we use. The direction of the components is also arbitrary, we could multiply the loadings by -1 and they would not be less valid. One of the main advantages of ICA in my opinion is that due to the statistical independence, the components are easier to interpret and (often) have clearer constructs than for instance principal components. For instance IC7 consists of just a single item (“behavior sexual risk”), which the algorithm determined was statistically independent from the other variables. Interestingly, if we look back at the output from the factor analysis, it says that “behavior sexual risk” scores highest for “uniqueness” there as well. Also IC3 is most strongly powered by the “intention commitment” variable, while also including some other variables about intention and empowerment. This makes labeling independent components (in my experience) a bit easier than principal components.

Again, we can also visualize the correlation between the different components.

ica_model$S |>
  cor() |>
  data.frame() |>
  rownames_to_column("comp_x") |>
  pivot_longer(
    cols = starts_with("X"),
    names_to = "comp_y", values_to = "correlation"
  ) |>
  ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 4)), color = "white") +
  labs(
    title = "Correlation between ICs",
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
  coord_equal()

Again, the correlations between the components here are zero throughout, which would fit with the statistical independence. While independent components aren’t forced to be uncorrelated, this is a feature of the statistical independence. If the correlation matrix isn’t zero throughout, this would be a sign that you’d need to adapt your ICA method.

Okay, let’s now compare the three approaches and put the loading matrices side by side so we can look at the differences are bit more closely. We’ll take the three weight matrices from factor analysis, PCA, and ICA, and bind them together. We’ll then create a plot much in the same way as before and we’ll add a facet to show the weight matrices side by side, separated by the method we used so it’s easier to compare.

all_weight_matrices <- bind_rows(
  fa_weight_matrix |>
    rename(comp = factor) |>
    mutate(alg = "FA"),
  pc_weight_matrix |>
    rename(comp = prin_comp) |>
    mutate(alg = "PCA"),
  ica_weight_matrix |>
    rename(comp = ic) |>
    mutate(alg = "ICA")
)

all_weight_matrices |>
  filter(parse_number(comp) <= n_comps) |>
  mutate(
    alg = str_glue("{alg} loadings"),
    alg = as_factor(alg)
  ) |>
  ggplot(aes(x = comp, y = variable, fill = loading)) +
  geom_tile() +
  labs(
    x = NULL,
    y = NULL
  ) +
  scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
  facet_wrap(~alg, scales = "free_x")

Here we can most clearly see the overlap between the three methods, Factor 1, PC1, and IC 6 capture essentially the same information. The same goes for Factor 2, PC2, and IC 4. Other than that we can see that the other components vary quite markedly. I wouldn’t be comfortable calling any of the other components “fairly similar” to another. You could see how some variables load together with multiple methods, but then the components those are captured in also have other information or miss information. We already discussed IC7 consisting mostly of a single variable, and a similar thing happens with Factor 4, but for a different variable.

BONUS: Hierarchical clustering

I’ll quickly go over hierarchical clustering too, it’s simple and easy to interpret. Hierarchical clustering works by taking your variables and clustering them first into two groups, then three, then four, and so on. It looks at similarity and a concept called “Euclidian distance” (other methods are available) between the variables and determines how to separate. Essentially, hierarchical clustering works by iteratively separating variables into groups until every variable is on its own. It does so rather aggressively, with the previous methods it’s possible for a variable to be part of two clusters, with hierarchical clustering it’s part of a single cluster only. This approach makes it an easy way to see how variables cluster together at different thresholds.

First we convert our data frame with the features to a matrix, then we transpose it. If we don’t transpose, we’d be clustering participants together, and we’re interested here in clustering of variables. Clustering of individuals can be interesting but this post is already long enough. Next we create a distance matrix using the dist() function, this is a matrix containing the relative distance between the variables to each other. You can think of the distance matrix as a sort of weighted adjacency matrix. Then we can supply this distance matrix to the hierarchical clustering function (hclust(), documentation). We don’t need to supply the number of clusters at this stage.

dist_data_features <- dist(t(as.matrix(data_features)))

hclust_model <- hclust(dist_data_features)

We can visualize the hierarchical clustering with a dendrogram simply by supplying the hclust_model to the plot() function (like plot(hclust_model)), but in this case I’d prefer to use the ggdendrogram() function from the {ggdendro} package

ggdendro::ggdendrogram(hclust_model)

Here we can see the different “branches” of the tree. You can see that for instance “empowerment knowledge” and “empowerment abilities” are very close to each other, but it’s a long way from “perception severity” to “motivation willingness” for instance. You may also notice the values along the y-axis. You could for instance “cut” the tree at y = 13 and you’d separate the tree into 2 “branches”. You could cut it at e.g. y = 8, and then you would get 11 clusters. Since we already determined that we wanted 7 clusters, we can ask another function to determine where to cut the tree to get the components we would like. This is implemented in the cutree() function.

hclust_weight_matrix <- cutree(hclust_model, k = n_comps)

The cutree() function assigns a cluster to each of the variables. It looks at the tree and determines where to cut the tree to get the desired number of branches and then tells you the composition of the branches. We can recreate this ourselves by simply adding a geom_hline() to the dendrogram. With some trial and error, it seems like cutting the tree at y = 10.5 will give us 7 clusters.

ggdendro::ggdendrogram(hclust_model) +
  geom_hline(yintercept = 10.5, color = "firebrick")

Let’s look at how the clusters are made up according to the hierarchical clustering model. This isn’t really a weight matrix, but rather a definition of the clusters. The “loadings” here are not numerical, but rather 1 or 0.

hclust_weight_matrix |>
  data.frame() |>
  rename(cluster = hclust_weight_matrix) |>
  rownames_to_column("variable") |>
  ggplot(aes(
    x = as_factor(cluster), y = variable,
    fill = as_factor(cluster)
  )) +
  geom_tile() +
  labs(
    x = NULL,
    y = NULL,
    fill = "cluster"
  ) +
  scico::scale_fill_scico_d(palette = "batlow")

We can again see that the same two clusters show up. Cluster 6 and 7 resemble those of Factor 2 and Factor 1, PC2 and PC1, and IC4 and IC6 respectively. If you want to learn more about what you can do with the hlcust() and associated functions, you can check out this webpage.

Okay I’d like to leave it at that. This blogpost is long enough. Again, I had to simplify and take some shortcuts, if you think I made mistakes in that effort, please let me know and I’ll fix it as well as I can!