Vectors of Mind III Assume you have a subtitle Contents • Random variables • Fundamental Theorem of FA • Rotational indeterminancy • Estimation • Principal factors • Iterative principal factors (i.e., OLS / ULS / minres) • Maximum likelihood Contents • Random variables • FUNdamental Theorem of FA • Rotational indeterminancy • Estimation • Principal factors • Iterative principal factors (i.e., OLS / ULS / minres) • Maximum likelihood • We‘re gonna see many easy, approachable, and friendly-looking equations • And lots of simplifications that make things even clearer Random variables Random variables • A variable whose values are samples from a probability distribution • It doesn‘t have one true value, the probability distribution is the variable as it represents the random process behind the variable • Example: 𝑿 ~ 𝑁(𝜇, 𝜎! ) X is normally distributed with a mean 𝜇 and variance 𝜎! Expected value • A long-run average of a random variable 𝑿 ~ 𝑁(𝜇, 𝜎! ) 𝐸 𝑿 = 𝜇 Expected values – sidenote • By the way! • You know what is defined as an expected value? • The CTT true score! 𝑋 = 𝜏 + 𝜀 𝜏 = 𝐸(𝑋) In other words, true score is defined as the long-run average of the raw score Expected value – constants • Expectation of a constant is the constant itself (because it‘s not a random variable) 𝐸 𝑪 = 𝑪 Expected values – variance • Now, consider the (scalar) formula for the variance of a random variable: σ! = ∑(𝑥 − 𝜇)! 𝑁 …which is the “mean squared deviation from the mean”, right? • As an expected value: 𝐸[(𝑋 – 𝜇)2] Expected values – variance/covariance matrix 𝐶" = 𝐸[(𝒙 – 𝝁) (𝒙 – 𝝁)’] • Expanding, we get the expectation of: (𝑥# − 𝜇#) (𝑥$ − 𝜇$) ⋮ (𝑥% − 𝜇%) (𝑥# − 𝜇#) (𝑥$ − 𝜇$) ⋯ (𝑥% − 𝜇%) • …which gives us the variance/covariance matrix of the manifest variables Expected values – centered variables • For centered variables (means = 0), we can omit the 𝝁s: 𝐶" = 𝐸[(𝒙 𝒙 ’)] Expected values – summary For normally distributed random variables: 𝑿 ~ 𝑁(𝜇, 𝜎! ) 𝐸 𝑿 = 𝜇 For non-random variables: 𝐸 𝑪 = 𝑪 To get a variance-covariance matrix: 𝐶" = 𝐸[(𝒙 – 𝝁) (𝒙 – 𝝁)’] For centered variables: 𝐶" = 𝐸[𝒙 𝒙 ’] Deriving the fundamental theorem of FA The data model in factor analysis • Recall the way we formulated the Common Factor Model earlier – we expressed the MVs as a linear function of the common factors and the unique factors: 𝑥&' = 𝜇' + 𝜆'# 𝑧&# + 𝜆'$ 𝑧&$ + ⋯ + 𝜆'( 𝑧&( + 1𝑢&' Mean + Common factor part + Unique factor part 𝑥&' = 𝜇' + ∑)*# ( 𝜆') 𝑧&) + 𝑢&' The data model in factor analysis 𝑥#$ = 𝜇$ + ∑%&' ( 𝜆$% 𝑧#% + 𝑢#$ Where: 𝑥!" is the score of person i on manifest variable j 𝜇" is the mean of manifest variable j 𝑧!# is the common factor score of person i on factor k 𝜆"# is the factor loading of manifest variable j on factor k 𝑢!" is the unique factor score of person i on unique factor j; and 𝑢!" = 𝑠!" + 𝑒!" 𝑠!" is the factor score of person i on specific factor j 𝑒!" is the error term for person i on manifest variable j The data model in factor analysis • We will consider the model as operating in a population, and thus we will consider the data model for a random individual by omitting the subscript i: 𝑥$ = 𝜇$ + 𝜆$' 𝑧' + 𝜆$! 𝑧! + ⋯ + 𝜆$( 𝑧( + 1𝑢$ 𝑥$ = 𝜇$ + ∑%&' ( 𝜆$% 𝑧% + 𝑢$ • Here we actually have p equations, one for each manifest variable 𝑥", … , 𝑥$, but we can express it all as a single equation using matrix notation: 𝒙 = 𝝁 + 𝚲𝒛 + 𝒖 The data model in factor analysis 𝒙 = 𝝁 + 𝚲𝒛 + 𝒖 Where: x is a p x 1 vector of a random person’s scores on the p manifest variables μ is a p x 1 vector of population means of the p manifest variables Λ is a p x m matrix of factor loadings, where p > m (rectangular matrix) z is a m x 1 vector of (unobservable) common factor scores u is a p x 1 vector of (unobservable) unique factor scores The data model in factor analysis 𝒙 = 𝝁 + 𝚲𝒛 + 𝒖 • For illustration, let’s extract the equation for the third manifest variable. Let’s assume that m = 3 (there are three common factors): 𝑥% = 𝜇% + 𝜆%& 𝜆%' 𝜆%% 𝑧& 𝑧' 𝑧% + 𝑢% 𝑥% = 𝜇% + 𝜆%& 𝑧& + 𝜆%' 𝑧' + 𝜆%% 𝑧% + 𝑢% The data model in factor analysis • The data model represents a random observation in the population. It is intended to explain the structure of the raw data (i.e., the scores on manifest variables) • However, it contains a LOT of unknowns • While we observe the manifest variables x and we can at least estimate the population means μ, the remaining terms in the equation are unknown to us • We do not know the latent scores z and u, in fact we cannot know them, since latent variables are unobservable • Similarly, we do not know Λ, the matrix of factor loadings – we are unaware of how the unobservable latent variables affect the (observable) manifest variables The data model in factor analysis • Well, that’s kind of a pickle. • So, do we just, like, go home now? • Maybe. Or we can help ourselves with some tricks. • We have already established that the latent variable scores are unobservable, so we might want to give up on trying to solve for them in the data model equation • Maybe if we turn the problem around, we can get rid of z and u completely and focus on Λ The data model in factor analysis • We could use the data model, along with some assumptions, to derive a covariance structure model • The data model is accompanied by assumptions about the joint distribution of the elements in z and u implies a model for the population covariance matrix. • The model for the covariance matrix is known as the covariance structure and is intended to explain the variances and covariances of the manifest variables, not the raw data. • Before we proceed to derive the covariance structure model, we’ll talk about the important distributional assumptions and lay down some notational rules. Assumptions • We will make the following assumptions about the common factors z and unique factors u: 1. The common factors and the unique factors are independently distributed. As such, the common factors are uncorrelated with the unique factors. In other words, 𝜮() = 𝟎 = 𝜮′)( 2. The unique factors are mutually independent. As such, the unique factors for different MVs are uncorrelated with each other. This implies that the covariance matrix 𝜮)) is diagonal. 3. The common factors and the unique factors are standardized to have means of zero. 4. The common factors are also standardized to have unit variances (variances of 1). Assumptions – recap Common factors: 𝑆𝑖𝑛𝑐𝑒: 𝐸 𝒛 = 0 𝑇ℎ𝑒𝑛: 𝐸 𝒛𝒛‘ = 𝜮)) 𝐴𝑙𝑠𝑜: 𝐸 𝒛𝒛‘ = 𝜮)) = 𝑹)) Assumptions – recap Unique factors: 𝑆𝑖𝑛𝑐𝑒: 𝐸 𝒖 = 0 𝑇ℎ𝑒𝑛: 𝐸 𝒖𝒖‘ = 𝜮)) 𝐴𝑙𝑠𝑜: 𝐸 𝒖𝒖‘ = 𝜮** = 𝑹** = 𝑰 Assumptions – recap Relationship between unique and common factors: 𝑹)* = 𝟎 = 𝑹‘*) Deriving the mean structure • The mean and covariance structures are derived from the data model: 𝒙 = 𝝁 + 𝚲𝒛 + 𝒖 • Let’s derive the mean structure first. We want an equation that represents the mean vector μ of the manifest variables. Deriving the mean structure • If we take the expectation of both sides of the equation above, we get: 𝐸(𝒙) = 𝐸(𝝁) + 𝐸(𝜦𝒛) + 𝐸(𝒖) 𝐸(𝒙) = 𝝁 + 𝜦𝐸(𝒛) + 𝐸(𝒖) • Given the assumptions we previously talked about, this follows: 𝝁* = 𝝁 + 𝜦𝟎 + 𝟎 𝝁* = 𝝁 Therefore, means do not depend on the model! Deriving the covariance structure • Alright. Let’s consider the derivation of the covariance structure. • We know that it scores on the MVs are supposed to be weighted linear combinations of common factors and unique factors: 𝑿 = 𝚲𝒁 + 𝚿𝑼 • Right? Regression style. Deriving the covariance structure • First, let’s standardize the vector of MVs to make our lives easier. • This way, 𝚺"" becomes 𝑹"" 𝑿 = 𝚲𝒁 + 𝚿𝑼 • We also know that in this case, 𝑹"" = E(𝐗𝐗‘) • So: 𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝚲𝒁 + 𝚿𝑼 ‘ ] Deriving the covariance structure 𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝚲𝒁 + 𝚿𝑼 ‘ ] • Now, let’s transpose the second bracket (it simply transposes all the elements while switching the positions of products): 𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝒁‘𝚲‘ + 𝑼‘𝚿‘ ] • Now, let’s multiply the contents of the expectation: 𝑹"" = 𝐸[ 𝚲𝒁𝒁‘𝚲‘ + 𝚲𝒁𝑼‘𝚿‘ + 𝚿𝑼𝒁‘𝚲‘ + 𝚿𝑼𝑼‘𝚿‘] Deriving the covariance structure 𝑹"" = 𝐸[ 𝚲𝒁𝒁‘𝚲‘ + 𝚲𝒁𝑼‘𝚿‘ + 𝚿𝑼𝒁‘𝚲‘ + 𝚿𝑼𝑼‘𝚿‘] • Now, we just get rid of the expectation (just like I got rid of expecting to finish my Ph.D. on time): 𝑹!! = 𝐸 𝚲 𝐸 𝒁𝒁‘ 𝐸 𝚲‘ + 𝐸 𝜦 𝐸 𝒁𝑼‘ 𝐸 𝚿‘ + 𝐸 𝚿 𝐸 𝑼𝒁‘ 𝐸 𝜦’ + 𝐸 𝚿 𝐸 𝑼𝑼‘ 𝐸(𝚿‘) • All the factor weights (loadings) 𝚲 and 𝚿 are constants: 𝑹"" = 𝚲𝐸 𝒁𝒁‘ 𝚲‘ + 𝚲E 𝐙𝐔‘ 𝚿‘ + 𝚿𝐸 𝑼𝒁‘ 𝚲‘ + 𝚿𝐸 𝑼𝑼‘ 𝚿‘ Deriving the covariance structure • 𝐸 𝒁𝒁‘ is the variance-covariance matrix of the common factors 𝜮++ • 𝐸 𝑼𝑼‘ is the variance-covariance matrix of the unique factors 𝜮,, • 𝐸 𝒁𝑼‘ and 𝐸 𝑼𝒁‘ are the variance-covariance matrices of the common and unique factors among themselves 𝜮,+. So, this: 𝑹"" = 𝚲𝐸 𝒁𝒁‘ 𝚲‘ + 𝚲E 𝐙𝐔‘ 𝚿‘ + 𝚿𝐸 𝑼𝒁‘ 𝚲‘ + 𝚿𝐸 𝑼𝑼‘ 𝚿‘ • Becomes this 𝑹"" = 𝚲𝜮++ 𝚲‘ + 𝚲𝜮,+ 𝚿‘ + 𝚿𝜮,+ 𝚲‘ + 𝚿𝜮,, 𝚿‘ Deriving the covariance structure 𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘ • Phew! Now what? Deriving the covariance structure 𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘ • Phew, now what? • We get help from our favourite superhero – Mr. Assumptionman – of course! Deriving the covariance structure 𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘ • Phew, now what? • We get help from our favourite superhero – Mr. Assumptionman – of course! (He’s bold because he’s a matrix. Wait what, that doesn’t make sense) Deriving the covariance structure 𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘ • We know some stuff about those sigmas: • 𝜮)( = 𝟎 = 𝜮)( • 𝜮)) = 𝑰 𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝟎𝚿‘ + 𝚿𝟎𝚲‘ + 𝚿𝑰𝚿‘ 𝑹"" = 𝚲𝜮)) 𝚲’ + 𝚿 𝟐 Notation • We will use the following notation from now on: The manifest variable covariance matrix: 𝚺 = 𝚺++ The common factor covariance matrix: 𝚽 = 𝚺(( The unique factor covariance matrix: 𝑫, = 𝚺)) • Note that (because of the assumptions we made), the diagonal elements of 𝚽 are required to be equal to 1. Thus, 𝚽 is a factor correlation matrix. Fundamental Theorem of FA 𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐 • For 𝚺: 𝚺 = 𝚲𝛟𝚲’ + 𝑫 𝝍 𝚿 𝟐 is a diagonal matrix with uniquenesses on the diagonal 𝑫 𝝍 is a diagonal matrix with unique variances on the diagonal (to scale the resulting correlation matrix into a variance-covariance matrix) Implications of this 1. We don‘t need to know the latent scores (common or unique) to reproduce the variance-covariance matrix of MVs! Whaaaat! Implications of this 2. The model assumes all correlations between MVs are only due the effect of the common factors: 𝐸 A𝑿A𝑿‘ = A𝜮 = 𝜦𝝓𝜦’ Or 𝐸 A𝑿A𝑿‘ = A𝜮 = 𝜦𝜦’ for uncorrelated factors • How well this assumption corresponds with the reality is what model fit testing is all about. It‘s sometimes called the local independence assumption. Implications of this 2. The model assumes all correlations between MVs are only due the effect of the common factors: 𝐸 \𝑿\𝑿‘ = \𝜮 = 𝜦𝝓𝜦’ This also implies that 𝜆 is a regression coefficient. It tells us how the value of x changes with a unit change in the factor score. Which is what we started with so it‘s probably not a surprise. Implications of this 3. We can use the information from 𝚿 𝟐 to create the so-called reduced correlation matrix that has communalities (1-uniqueness) on the diagonal and MV correlations off the diagonal. 𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐 𝑹"" − 𝚿 𝟐 = 𝚲𝛟𝚲’ • This will be useful later on Implications of this 4. 𝚲 has different elements for 𝑹"" and 𝚺. • As we already covered, 𝚿 𝟐 contains uniquenesses while 𝑫$ contains (unscaled) unique variances • Similarly, we distinguish 𝚲 and 𝚲*, the second containing standardized factor loadings 𝜆∗ 𝜆& ∗ = '! ()! Or in matrix-speak: 𝚲∗ = 𝐃* + ⁄- ! 𝚲 where 𝐃* + ⁄- ! is a diagonal matrix with reciprocal SDs (1/SD) of MVs on the diagonal An example • Consider the covariance structure for uncorrelated (orthogonal) factors to better understand the relationship between elements of 𝚺 and the elements of 𝚲 and𝑫,. An example: 𝚺 = 𝜎&& 𝜎'& 𝜎'' 𝜎%& 𝜎%' 𝜎%% 𝜎-& 𝜎-' 𝜎-% 𝜎-- = 𝜆&& 𝜆&' 𝜆'& 𝜆'' 𝜆%& 𝜆%' 𝜆-& 𝜆-' 𝜆&& 𝜆'& 𝜆%& 𝜆-& 𝜆&' 𝜆'' 𝜆%' 𝜆-' + 𝜓&& 𝜓'' 𝜓%% 𝜓-- • This shows us that 𝜎&& = 𝜆&& ' + 𝜆&' ' + 𝜓&& • Also, 𝜎'& = 𝜆'& 𝜆&& + 𝜆'' 𝜆&' • The covariance between two MVs is the sum of the products of their loadings on the common factors • The variance of an MV is the sum of its squared loadings and its unique factor variance An example An example • The factor loading matrix is: • The covariance between PC and VO: 𝜎'& = 𝜆'& 𝜆&& + 𝜆'' 𝜆&' = 0.7 ∗ 0.7 + 0.0 ∗ 0.1 = 0.49 • Let’s compute the communality and the unique variance of PC by hand Communality • The j-th diagonal element ψjj of 𝑫, is the j-th unique variance. The j-th communality (proportion of variance of MV j due to common factors) can be written as: ℎ"" = [𝚲𝚽𝚲.]"" 𝜎"" = 1 − 𝜓"" 𝜎"" • If the factors are uncorrelated, then: ℎ"" = [𝚲𝚲.]"" 𝜎"" = 1 − 𝜓"" 𝜎"" ...that is, the sum of squares of row j of 𝚲 divided by the variance of the j-th MV. Recap • We wanted to predict MVs as a weighted combination of common and unique factor scores: 𝑿 = 𝚲𝒁 + 𝚿𝑼 • But we don‘t know the scores, so, instead, we look at their covariance structure • That‘s how we got here: 𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐 Recap 𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐 • 𝚲 = matrix of factor loadings (k x p) • 𝛟 = matrix of factor correlations (p x p) • 𝚿 𝟐 = diagonal matrix of uniquenesses (k x k) • You can also scale this into a covariance equation as shown before. Estimation Estimation of FA parameters • Now we need to think about how to find the values to place into the model matrices 𝚲,𝛟, 𝚿 𝟐 • Let‘s start with an ideal scenario where the factors are uncorrelated (so 𝛟 = 𝑰 ) and our observed correlation matrix is the population correlation matrix 𝐏 (i.e., there is no sampling error involved) Rotational indeterminancy 𝐏 = 𝚲𝚲: + 𝑫 𝝍 • But wait! Even in this scenario, things are weird: 𝐏 = 𝚲 𝟏 𝚲 𝟏 : + 𝐃 𝝍 = 𝚲 𝟐 𝚲 𝟐 : + 𝐃 𝝍 = 𝚲 𝟑 𝚲 𝟑 : + 𝐃 𝝍 = ⋯ What the hell! Are you telling me there are infinite possible 𝚲s? Rotational indeterminancy • Hell yeah! If you have more than two factors, there is no unique solution to be found. • Suppose I’m not wrong and it indeed holds that: 𝐏 = 𝚲 𝟏 𝚲 𝟏 . + 𝐃 𝝍 = 𝚲 𝟐 𝚲 𝟐 . + 𝐃 𝝍 (we’re just considering two solutions now, but there are infinitely many) • In that case, one solution (𝚲 𝟐) has to be linked in some way with the other (𝚲 𝟏). To be precise, 𝚲 𝟐 = 𝚲 𝟏 𝐓 where T is a m x m orthogonal matrix (𝐓𝐓. = 𝐈) Rotational indeterminancy • In that case, one solution (𝚲 𝟐) has to be linked in some way with the other (𝚲 𝟏). To be precise, 𝚲 𝟐 = 𝚲 𝟏 𝐓 where T is a m x m orthogonal matrix (𝐓𝐓: = 𝐈) 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝐓(𝚲 𝟏 𝐓): 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝐓(𝐓′𝚲 𝟏′) 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝐓𝐓′𝚲 𝟏′ 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝐈𝚲 𝟏′ 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝚲 𝟏′ • See? 𝚲 𝟏 and 𝚲 𝟐 are equally fine solutions. Rotational indeterminancy • In other words, if we can find one solution, we can find other alternative solutions. We simply choose any matrix T such that TT’ = I and we define 𝚲 𝟐 = 𝚲 𝟏 𝐓 • We’ve just seen that 𝚲 𝟏 and 𝚲 𝟐 are equally good solutions, since 𝚲 𝟐 𝚲 𝟐 : = 𝚲 𝟏 𝚲 𝟏′ • This is called rotational indeterminacy. Rotational indeterminancy • We must resolve this problem somehow if we want to find a single, unique solution for 𝚲 every time we perform a factor analysis. • In other words, we need to find a criterion for defining this unique solution. • Luckily, we can arrive at a solution with the help of Eigenvalues and Eigenvectors. Eigenvalues, eigenvectors, and 𝚲 • Recall that the eigenstructure of a symmetric matrix S is the following: 𝐒 = 𝐔𝐃𝒍 𝐔: ...where the columns of U are eigenvectors and the diagonal elements of 𝐃𝒍 are eigenvalues (this is a diagonal matrix). I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • Now, let’s take a look at Hypothetical Scenario 1: • You know the true 𝐏 • You know the unique factor variances (𝚿 𝟐), and thus you also know the communalities (diagonal of 𝐏 − 𝚿 𝟐) • The model holds perfectly in the population data I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • In this scenario, obtaining 𝚲 is actually quite easy • You take the reduced correlation matrix (𝐏 − 𝚿 𝟐 ) • Because 𝚿 𝟐 is a diagonal matrix containing uniquenesses, the diagonal of (𝐏 − 𝚿 𝟐 ) will contain (1 – uniquenesses), thus, it will contain the true communalities I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • Perform the eigenvalue-eigenvector decomposition of (𝐏 − 𝚿 𝟐 ), which will yield some eigenvectors 𝐔 and some eigenvalues 𝐃𝒍 • Order the eigenvalues by size from largest to smallest • The first m eigenvalues will be non-zero, the rest will be zero (why?) • Keep these non-zero eigenvalues and their associated eigenvectors • Note: This is the same as PCA, only done on 𝐏 − 𝚿 𝟐 instead of 𝐏 I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • Keep only the nonzero eigenvalues in 𝐃𝒍, take their square roots and put them back again into a matrix we will call 𝐃𝒍 𝟏/𝟐 • Then, calculate 𝚲 = 𝐔𝐃𝒍 𝟏/𝟐 • Magic. • Let’s look at an example, using the example data we have seen before. • The matrix P is given as follows: I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • Assume the unique variances are known: 𝚿 𝟐 = 0.50 .51 .50 .28 • So the matrix P with communalities in the diagonal is given by: 𝑷 − 𝑫? = .50 .49 .49 .14 .07 .50 .48 .42 .48 .72 I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • We can obtain the eigenvalues and eigenvectors of 𝑷 − 𝚿 𝟐 • The non-zero eigenvalues are: 𝑫@ = 1.662 .548 • And the corresponding eigenvectors: 𝐔 = .502 −.386 .461 −.500 .353 .731 .641 .259 I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • The factor loading matrix can be obtained: 𝚲 = 𝐔𝐃𝒍 𝟏/𝟐 𝚲 = .647 −.285 .594 −.370 .455 .541 .826 .192 • Wait…that’s not the loading matrix I have shown you last time for the example data, is it? I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • It’s a transformation of the matrix I have shown you earlier, in the rotational indeterminacy sense, 𝚲 𝟐 = 𝚲 𝟏 𝐓 𝚲 = .647 −.285 .594 −.370 .455 .541 .826 .192 = .70 .10 .70 .00 .10 .70 .60 .60 .848 −.529 .529 .848 • Both 𝚲 matrices provide an exact solution to the model. The procedure involving eigen-stuff allowed us to identify the unique solution, though. I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly • Okay, so, I have just shown you how to obtain the solution (𝚲) if: • You know the population correlation matrix, P • You know the contents of 𝚿 𝟐 , so you know the unique variances or (conversely) the communalities • The model holds exactly in the population • Huh. Putting the “model holds exactly” thing aside, you will never know P and you will never know 𝚿 𝟐 , so this is a theoretical scenario. I: We know 𝐏, 𝚿 𝟐 and the model holds perfectly II: We know 𝐏 and the model holds perfectly. We don‘t know 𝚿 𝟐 • As I said, the solution obtained by doing the eigen-decomposition of 𝑷 − 𝑫? requires that you know either the unique variances or the communalities (once you know one, you know the other, right?) • But we don’t know these, since finding out what they are is a part of the problem we face. • When factor analysis was young, this was called the “Communality problem” • Many solutions were suggested to the communality problem. • The one that “won” (was and is the most widely used) was suggested by Louis Guttman in 1940. • Guttman suggested squared multiple correlations (SMCs) as the initial approximations to communalities. II: We know 𝐏 and the model holds perfectly. We don‘t know 𝚿 𝟐 • Just what is a squared multiple correlation (SMC)? • Imagine you have p manifest variables. You can try to predict the j-th manifest variable from the other (p - 1) manifest variables, linear regression-style. • This prediction will be imperfect. You can correlate these predicted values of the j-th manifest variable with the actual values of the variable. What you will get is a correlation coefficient, the multiple correlation coefficient. Square it and you get the SMC. II: We know 𝐏 and the model holds perfectly. We don‘t know 𝚿 𝟐 • Guttman has shown that if the factor model applies to the population correlation matrix P, then the squared multiple correlation of the j-th manifest variable on the other (p – 1) manifest variables is the lower bound for the communality of the j-th manifest variable. • So, not knowing the contents of 𝑫?, one might approximate the manifest variable communalities with manifest variable SMCs, computed from P. These approximations can then be substituted into the diagonal of P and one can, again, use the eigenvalue-eigenvector approach on this modified P matrix to obtain 𝚲. II: We know 𝐏 and the model holds perfectly. We don‘t know 𝚿 𝟐 • However, in order to obtain the population SMCs, we need to know P in the first place. Most often, we don’t. • In practice, we can apply the same procedure to a sample correlation matrix, R, in order to obtain sample SMCs. III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • So far, we have studied factor analysis limiting ourselves to the ideal scenario in which we know the population correlation matrix, P. Moreover, we only considered the case where the model holds exactly in the population. • Now, let’s consider the real world in which we do not have access to P but we do have access to R. In this scenario, we are not even sure the sample correlation matrix R is drawn from a population with a correlation matrix P for which the model holds. • As before, let’s just consider the uncorrelated / orthogonal model for now. III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • First of all, we should tone down the optimism. In our hypothetical scenarios, we could select 𝚲 and 𝚿 𝟐 to reconstruct P perfectly: 𝐏 = 𝚲𝚲: + 𝚿 𝟐 • In reality, our estimates of 𝚲 and 𝚿 𝟐 , \𝚲 and p𝚿 𝟐 , will generally not be able to exactly reproduce our sample correlation matrix R: 𝐑 ≈ \𝚲\𝚲: + p𝚿 𝟐 III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • So, what we want is a parsimonious model (m << p) that provides a relatively good approximation to the data we have observed. • This degree of approximation (how well the model fits the data) is reflected in the residual matrix, defined as 𝐑 − \𝚲\𝚲: + p𝚿 𝟐 • The residual matrix tells us how far away the correlation matrix R we have observed is from the correlation matrix the model predicts. In other words, how far is the observed correlation matrix from the modelimplied correlation matrix (which is simply \𝚲\𝚲: + p𝚿 𝟐) III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • Every element in the residual matrix tells us how far is the model-implied (predicted) value of this element from its observed value. • Alright, so – again, we don’t have a population correlation matrix P which we used for all the computations and methods covered before. What are we going to do? • Of course, we’re going to pretend like the problem isn’t there and we’ll start by doing things in the exact same way. III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • Again, we will obtain some eigenvalues and some eigenvectors. However, in this case (not having a population correlation matrix, not being sure the model holds exactly in the population), we will generally not obtain an eigen-solution where the (p – m) smallest eigenvalues are zero. • Thus, we cannot rely on the number of non-zero eigenvalues to show us the “true” number of factors (m). Thus, we will have to choose m ourselves beforehand, based on our best judgement III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • Thus, having chosen the number m beforehand, we will take the m largest eigenvalues and their corresponding m eigenvectors. • Just like before, we will take the square root of the eigenvalues, sort them by size and place them in a diagonal matrix \𝑫@( '/! • And, just like before, we will create a matrix \𝑼( with the corresponding eigenvectors as columns. III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • Then, we can use the eigenvalues and eigenvector matrices to compute our estimate of factor loadings: \𝜦 = \𝑼( \𝑫@( '/! • The \𝜦 obtained in this way minimizes the residual sum of squares (RSS): 𝑅𝑆𝑆 = 1 2 . "#$ % . &#$ % [ 𝑹 − 2𝚿 𝟐 − 5𝚲5𝚲′]"& ! III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 • This \𝜦 results in minimum sum of squared residuals, conditional on the given set of prior communality estimates. • This method is known as the principal factor method using prior communality estimates III: The model does not hold perfectly. We don‘t know 𝚿 𝟐 or 𝐏 Short review • So, what was the principle behind the principal factor method using prior communality estimates? Let’s do a short recap: 1) First, we obtain some communality estimates (like SMCs) and plug them into the diagonal of R. Thus, we get our estimate of 𝑹 − 𝚿 𝟐 2) Then, we obtain the eigen-solution of 𝑹 − 𝚿 𝟐 3) We use the eigen-solution to obtain \𝚲 4) What we just got is a solution that minimizes the Residual Sum of Squares (RSS) given our initial p𝚿 𝟐 Iterative procedure • We will start by doing things the same way we did previously, using the principal factors method: • 1) First, we obtain some communality estimates (like SMCs) and plug them into the diagonal of R. Thus, we get our estimate of 𝑹 − 𝚿 𝟐 • 2) Then, we obtain the eigen-solution of 𝑹 − 𝚿 𝟐 • 3) We use the eigen-solution to obtain \𝚲 • …but we won’t end here. We will use the computed \𝚲 to obtain new communality estimates by summing the squared elements in each row of \𝚲 (diagonal elements of \𝚲\𝚲: ) Iterative procedure • We shall take the new communality estimates and plug them into the diagonal of R. Thus, we get a new 𝑹 − 𝚿 𝟐 • Again, we obtain the eigen-solution of this new 𝑹 − 𝚿 𝟐 and use it to compute a new \𝚲 • …and repeat (use the newly computed \𝚲 to again obtain new communality estimates). We continue this process until the communalities obtained in successive iterations do not significantly differ by some pre-set criterion (convergence criterion). Iterative procedure • That’s really all there is (in principle) about OLS. • By the way, the RSS function (the formula we have seen before) is a discrepancy function – it quantifies the distance between the observed and model-implied correlation matrices. In other words, it expresses the degree of model misfit. • Being a discrepancy function, it is always greater than or equal to zero and is zero only when the observed and model-implied correlation matrices are the same. Heywood cases • One nasty thing can happen when using OLS estimation • That is, some communalities can, in the course of the iterations, be greater than one. Conversely, the unique variances can become less than zero (because in a standardized solution, the communality and the unique variance of an MV add up to one) • But there’s no such thing as negative variance. Thus, such a solution would be nonsensical and unacceptable. We call these occurrences Heywood cases Summary • We considered multiple scenarios of fitting the model to data. Let’s do a quick review. • 1) You know P and you know 𝚿 𝟐 . You can obtain the eigen-solution of (P -𝚿 𝟐 ) to compute 𝚲. ….however, this will never be the case in practice. Summary • We considered multiple scenarios of fitting the model to data. Let’s do a quick review. • 2) You know P but you do not know 𝚿 𝟐 . You can estimate communalities using SMCs and plug them into the diagonal of P to obtain (P - 𝚿 𝟐 ). Afterwards, you obtain the eigen-solution of (P - 𝚿 𝟐 ) to obtain 𝚲. ….however, this will also never be the case in practice. Summary • We considered multiple scenarios of fitting the model to data. Let’s do a quick review. • 3) You do not know P and you do not know 𝚿 𝟐 . All you have is R. You can estimate communalities using SMCs and plug them into the diagonal of R to obtain (R - p𝚿 𝟐). Obtain the eigen-solution of (R - p𝚿 𝟐) to get \𝚲. ….the solution minimizes RSS given your original p𝚿 𝟐. This can happen very often in practice, although we would normally use a better option coming up next. Summary • We considered multiple scenarios of fitting the model to data. Let’s do a quick review. • 4) You do not know P and you do not know 𝚿 𝟐 . All you have is R. You can estimate communalities using SMCs and plug them into the diagonal of R to obtain (R - p𝚿 𝟐). Obtain the eigen-solution of (R - p𝚿 𝟐) to get \𝚲. Use the computed \𝚲 to obtain new communality estimates from the diagonal of \𝚲\𝚲: . Return to the beginning with fresh new communality estimates, repeat until convergence. Intermezzo • Phew! We‘ve covered a lot of ground, right? • And that was still just the foundation of the unrestricted FA • Now, let‘s look at stuff directly applicable to the restricted FA that is the main variation of FA one should learn / use Hints 1/ He was 23 when the photo was taken Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics 3/ He was a professor of eugenics Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics 3/ He was a professor of eugenics 4/ Opposed Bayesian stats Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics 3/ He was a professor of eugenics 4/ Opposed Bayesian stats 5/ Coined the „sexy son hypothesis“ Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics 3/ He was a professor of eugenics 4/ Opposed Bayesian stats 5/ Coined the „sexy son hypothesis“ 6/ Popularized t-distribution Hints 1/ He was 23 when the photo was taken 2/ Year before that, he created the most potent estimation tool in statistics 3/ He was a professor of eugenics 4/ Opposed Bayesian stats 5/ Coined the „sexy son hypothesis“ 6/ Popularized t-distribution 7/ Created the ANOVA It‘s Ronald Fisher! Maximum Likelihood upsided • ML is an ingenious method of estimating parameters • It has beautiful properties: • Consistency = with increasing N, ML estimates (MLE) converge to the true values • Efficiency = MLE is the „best“ way to get these estimates • Asymptotic normality = with increasing N, MLE are normally distributed • But this only holds when asssumptions are met Maximum Likelihood downsides • Assumptions of ML: • Large samples! (e.g., Gorsuch claims that for FA, N = 200 is probably enough if you have few variables, while N = 50 is not) • Multivariate normality (i.e., the variables together are distributed normally) • But you can assume any kind of distribution • So it‘s a good idea to check N and the distributions but, luckily, MLE seems to be pretty robust against violations of the distributional assumptions at least • However, MLE is very computationally intensive! Multivariate normality • n-dimensional generalization of univariate normality • Variables are normal together • When they are MVN, they are also normal each on their own • But all of them being normal on their own does not imply they are MVN • Weird stuff, right? Multivariate normality Multivariate normality Probability density • Coming back to the 1-dimensional world for simplicity, we can describe any normal distribution using its mean and variance • After choosing these values, we can calculate the probability of x having a certain value given a normal distribution with the preset mean and variance: 𝑓 𝑥 = 1 2𝜋𝜎! 𝑒 A "AB ! !C! = 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∗ 𝑒A ' ! )! = 1 2𝜋𝜎! 𝑒𝑧! Probability density • Well, and then we simply plug in the values. • Let‘s say we are interested in the probability of x = 5 given mean = 3 and variance = 10 𝑃 𝑥 = 5 | 𝜇 = 3, 𝜎! = 10 = 1 2𝜋10 𝑒A DAE ! !∗'G = .10 See? e z (pun intended) The concept of likelihood • Likelihood is about turning this problem around. • You usually don‘t know the parameters of the distribution, right? But you know the values. So you can ask: What set of parameters makes my observations the likeliest (most probable)? Maximizing the likelihood • We have a set of observations x and are interested in guessing what normal distribution they came from • This requires a slight change in the formula: ℒ 𝒙 |𝜇, 𝜎! = 1 2𝜋𝜎! H ~ ' H 𝑒 A "AB ! !C! = 𝑁 ln 1 2𝜋𝜎! + • ' H ln(− 𝑥 − 𝜇 ! 2𝜎! ) Maximizing the likelihood • This badass formula creates a function for which we can find a maximum • X axes are mean and variance • Y axis is the likelihood value • The top of the mountain is the maximum likelihood = where we can find the likeliest combination of parameters Computing maximum likelihood • If you now imagine extending this idea into n dimensions, you probably see that this forms an n-dimensional likelihood monster onto which we need to climb to find the top • Computing it directly (using derivatives as shown in Mulaik) is possible only in simple cases • Otherwise, one needs to use an iterative algorithm MLE iterations • These algorithms are mostly concerned with navigating the parameter space efficiently • You see, the mountain is all the possible values our parameters can take. • And there are rules to this madness – if you stumble upon a steep climb, you are probably on a good track to finding the top! • So the algorithm computes the steepnes of a climb (2nd derivative) at a given point and then jumps in a certain direction according to the results MLE iterations • It‘s basically a game of hide and seek • The goal is to find the top, and the steepnes of the n-dimensional mountain tells you whether you are hotter or colder • Isn‘t that beautiful? MLE algorithms • As a sidenote, this means that you should not confuse MLE as an estimation technique with the algorithms of computing the result. • It‘s always the same ML whether you choose: • Newton-Raphson • Expectation-Maximization • Metropolis-Hastings Robbins-Monro (MCMC) • They only differ in how they navigate the n-dimensional mountain Local maxima / minima • The technical term for the top of the mountain (MLE) is the global maximum • However, in some cases, the shape of the mountain can be strange, having multiple smaller peaks along with the highest peak • These smaller peaks are called local maxima • When the alogrithm misidentifies this peak for the top (thinking that you can only descend, not ascend further from that point), you get a wrong result Algorithms – sidenote • By the way, if you ever heard of the term Hessian matrix, then that‘s sort of a map to this n-dimensional monster we are exploring • It is a matrix containing information about how steeply the monster we are finding the top of rises / descends at various points • Mathematically, it is a matrix of second partial derivatives at critical points Algorithms – sidenote MLE in FA • So, the goal of ML estimation is to find such \𝚲 and p𝑫? so that the value of the -2 x log-likelihood function is minimized (to ease computations) • This function is again a discrepancy function – it is always larger than or equal to 0 and is zero if and only if the model-implied correlation matrix equals the sample correlation matrix Maximum likelihood estimation • The logic of ML estimation is very similar to that of (iterative) OLS: 1) Initial estimates of \𝑫 𝜓 are obtained (by SMCs or other means) 2) A maximum likelihood estimate of 𝚲 is obtained, conditional on the estimated \𝑫 𝜓 3) A model-implied reduced correlation matrix is obtained, which completes the first iteration 4) New iteration is initiated with most recent estimates of \𝑫 𝜓 5) Iterations continue until convergence is achieved Summary • We have described three different methods for fitting the common factor model to sample data: • Principal factors with prior communality estimates (noniterative) • Ordinary least squares (iterative principal factors) • Maximum likelihood • These are methods for fitting the model to data. Many more methods exist, but OLS and ML are commonly available. Summary • But it’s all about some type of discrepancy function that represents the distance between the observed and the expected • These functions take different forms as they all define the best solution differently • OLS wants to minimize the residual sum of squares • MLE wants to maximize the likelihood of the parameters given the observed data Which is best? • Minres / PA work when distributional assumptions are broken • ML works well with large samples and allows model comparison • However, there are simulation studies that really hype minres stating it performs the best under most conditions • I‘d recommend trying both minres and ML and checking whether the results meaningfully differ (i.e., you would write a different Discussion section if you were to use the other one) • Usually, you might not see any real difference, so no need to worry • This approach is called a sensitivity analysis What comes next? • You‘ve learned a LOT (and I have as well alongside you) • In 2 weeks, we meet for our last Vectors of Mind session (sniff) • We will talk about applied topics like: • Model fit assessment • CFA vs. EFA • Extremely (and I mean extremely) cool approaches to hypothesis testing in FA • Good research practices concerning FA • Reporting standards for CFA • Bifactor models / ordinal FA