Appendix B – Further on LCA and model selection

Latent Class Analysis (LCA) is a statistical technique used to identify unobserved or latent subgroups within a population based on observed categorical variables. It is a type of finite mixture modelling that assumes the population consists of several distinct groups, each characterized by a unique pattern of responses or probabilities on the observed variables. It provides a valuable tool for understanding heterogeneity within populations and can help researchers gain insights into the characteristics and behaviour of different subgroups.

The goal of LCA is to assign individuals to the most appropriate latent class based on their patterns of responses to a set of categorical variables, which in this setting is employment barriers. It allows researchers to understand the underlying structure or typology of a population by identifying groups of individuals who share similar response patterns. In other words, it allows us to identify groups of individuals who share similar employment barriers.

LCA assumes that the observed categorical variables (i.e., the employment barriers) are indicators of the latent classes and that the relationship between the latent classes and the observed variables can be captured by probabilities.

The process of conducting LCA involves several steps. First, the number of latent classes needs to be specified based on theoretical considerations and/or model fit criteria. This process is described further below. Then, the model estimates the latent class probabilities and item-response probabilities using maximum likelihood estimation. Once the model is estimated, individuals can be assigned to the most likely latent class based on their response patterns.

As mentioned above the number of latent classes needs to be specified based on theoretical considerations and/or model fit criteria. We use both strategies to determine the optimal number of groups. The point of departure is a baseline model with the ten identified barriers as the only inputs. We then estimate 20 model with number of groups varying from 1 to 20 groups. Similar to Fernandez et al. (2016), we calculate the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC) and the classification error for each model.

In simple terms the AIC and BIC balances the goodness of fit of a model with its complexity (they penalize models for being too complex). The goal is to find a model that fits the data well using the fewest number of parameters. The difference between the two is that BIC places a stronger penalty on model complexity compared to AIC (which is also evident in figure 6.1). If you only rely on model fit criteria, you will select the model with the lowest AIC and BIC since it indicates a better trade-off between model fit and complexity.

The classification error provides further information for the choice of the optimal number of latent classes. This measure summarise how well the model is able to classify individuals into clusters. Basically, when class-membership probabilities are far from 0 or 1 the assignment to classes can seem arbitrary, which is the intuition for the classification error. In general, a lower value signals a better classification of individuals into specific latent classes. Although a certain amount of classification error is natural in latent class analysis, values above 30pct signal that the model is not able to discriminate between classes in the allocation of a significant number of cases.

These statistics are presented in figure 6.1 for the 20 different models. First, the figure shows that the BIC penalizes the model complexity to a larger extent than AIC, which is evident due to the U-shape of the BIC-curve. As the model complexity increases (i.e., when we introduce more latent classes) the BIC starts increasing. The BIC suggest a model with 5 groups, since the BIC-curve has its minimum at a model with 5 classes, whereas the AIC curve almost assumes the same value for a model with 8 classes up until a model with 20 classes, and hence the AIC is not a helpful measure to guide model selection in this case. The model with 5 groups, however, has a slightly higher classification error compared to a model with 8 classes, but still in the adequate range. Nonetheless, compared to previous work in OECD’s Faces of Joblessness-project a model with 5 classes seems unusual and therefore we adopt a model with 8 latent classes, since it has a relatively low BIC, AIC, and classification error.

Figure 6.1 AIC, BIC, and classification error in LCA models with varying latent classes

Source: Own calculations based on EU-SILC from the Nordic countries.
Note: In all calculations, we use the weighting from the selected respondent.

It is important to note that model selection in LCA is not an exact science and involves a combination of statistical criteria and substantive judgment. Researchers should consider multiple methods and rely on a combination of statistical fit indices, theoretical considerations, and replication to make an informed decision about the number of latent classes to include in the analysis.