Vectors of Mind III
Assume you have a subtitle
Contents
• Random variables
• Fundamental Theorem of FA
• Rotational indeterminancy
• Estimation
• Principal factors
• Iterative principal factors (i.e., OLS / ULS / minres)
• Maximum likelihood
Contents
• Random variables
• FUNdamental Theorem of FA
• Rotational indeterminancy
• Estimation
• Principal factors
• Iterative principal factors (i.e., OLS / ULS / minres)
• Maximum likelihood
• We‘re gonna see many easy, approachable, and friendly-looking
equations
• And lots of simplifications that make things even clearer
Random variables
Random variables
• A variable whose values are samples from a probability distribution
• It doesn‘t have one true value, the probability distribution is the
variable as it represents the random process behind the variable
• Example:
𝑿 ~ 𝑁(𝜇, 𝜎!
)
X is normally distributed with a mean 𝜇 and variance 𝜎!
Expected value
• A long-run average of a random variable
𝑿 ~ 𝑁(𝜇, 𝜎!
)
𝐸 𝑿 = 𝜇
Expected values – sidenote
• By the way!
• You know what is defined as an expected value?
• The CTT true score!
𝑋 = 𝜏 + 𝜀
𝜏 = 𝐸(𝑋)
In other words, true score is defined as the long-run average of the raw
score
Expected value – constants
• Expectation of a constant is the constant itself (because it‘s not a
random variable)
𝐸 𝑪 = 𝑪
Expected values – variance
• Now, consider the (scalar) formula for the variance of a random variable:
σ!
=
∑(𝑥 − 𝜇)!
𝑁
…which is the “mean squared deviation from the mean”, right?
• As an expected value: 𝐸[(𝑋 – 𝜇)2]
Expected values – variance/covariance matrix
𝐶" = 𝐸[(𝒙 – 𝝁) (𝒙 – 𝝁)’]
• Expanding, we get the expectation of:
(𝑥# − 𝜇#)
(𝑥$ − 𝜇$)
⋮
(𝑥% − 𝜇%)
(𝑥# − 𝜇#) (𝑥$ − 𝜇$) ⋯ (𝑥% − 𝜇%)
• …which gives us the variance/covariance matrix of the manifest variables
Expected values – centered variables
• For centered variables (means = 0), we can omit the 𝝁s:
𝐶" = 𝐸[(𝒙 𝒙 ’)]
Expected values – summary
For normally distributed random variables:
𝑿 ~ 𝑁(𝜇, 𝜎!
)
𝐸 𝑿 = 𝜇
For non-random variables:
𝐸 𝑪 = 𝑪
To get a variance-covariance matrix:
𝐶" = 𝐸[(𝒙 – 𝝁) (𝒙 – 𝝁)’]
For centered variables:
𝐶" = 𝐸[𝒙 𝒙 ’]
Deriving the fundamental theorem of FA
The data model in factor analysis
• Recall the way we formulated the Common Factor Model earlier – we
expressed the MVs as a linear function of the common factors and the
unique factors:
𝑥&' = 𝜇' + 𝜆'# 𝑧&# + 𝜆'$ 𝑧&$ + ⋯ + 𝜆'( 𝑧&( + 1𝑢&'
Mean + Common factor part + Unique factor part
𝑥&' = 𝜇' + ∑)*#
(
𝜆') 𝑧&) + 𝑢&'
The data model in factor analysis
𝑥#$ = 𝜇$ + ∑%&'
(
𝜆$% 𝑧#% + 𝑢#$
Where:
𝑥!" is the score of person i on manifest variable j
𝜇" is the mean of manifest variable j
𝑧!# is the common factor score of person i on factor k
𝜆"# is the factor loading of manifest variable j on factor k
𝑢!" is the unique factor score of person i on unique factor j; and 𝑢!" = 𝑠!" + 𝑒!"
𝑠!" is the factor score of person i on specific factor j
𝑒!" is the error term for person i on manifest variable j
The data model in factor analysis
• We will consider the model as operating in a population, and thus we will
consider the data model for a random individual by omitting the subscript i:
𝑥$ = 𝜇$ + 𝜆$' 𝑧' + 𝜆$! 𝑧! + ⋯ + 𝜆$( 𝑧( + 1𝑢$
𝑥$ = 𝜇$ + ∑%&'
(
𝜆$% 𝑧% + 𝑢$
• Here we actually have p equations, one for each manifest variable 𝑥", … , 𝑥$, but
we can express it all as a single equation using matrix notation:
𝒙 = 𝝁 + 𝚲𝒛 + 𝒖
The data model in factor analysis
𝒙 = 𝝁 + 𝚲𝒛 + 𝒖
Where:
x is a p x 1 vector of a random person’s scores on the p manifest variables
μ is a p x 1 vector of population means of the p manifest variables
Λ is a p x m matrix of factor loadings, where p > m (rectangular matrix)
z is a m x 1 vector of (unobservable) common factor scores
u is a p x 1 vector of (unobservable) unique factor scores
The data model in factor analysis
𝒙 = 𝝁 + 𝚲𝒛 + 𝒖
• For illustration, let’s extract the equation for the third manifest variable. Let’s
assume that m = 3 (there are three common factors):
𝑥%
=
𝜇%
+
𝜆%& 𝜆%' 𝜆%%
𝑧&
𝑧'
𝑧%
+
𝑢%
𝑥% = 𝜇% + 𝜆%& 𝑧& + 𝜆%' 𝑧' + 𝜆%% 𝑧% + 𝑢%
The data model in factor analysis
• The data model represents a random observation in the population. It is intended
to explain the structure of the raw data (i.e., the scores on manifest variables)
• However, it contains a LOT of unknowns
• While we observe the manifest variables x and we can at least estimate the
population means μ, the remaining terms in the equation are unknown to us
• We do not know the latent scores z and u, in fact we cannot know them, since
latent variables are unobservable
• Similarly, we do not know Λ, the matrix of factor loadings – we are unaware of
how the unobservable latent variables affect the (observable) manifest variables
The data model in factor analysis
• Well, that’s kind of a pickle.
• So, do we just, like, go home now?
• Maybe. Or we can help ourselves with some tricks.
• We have already established that the latent variable scores are unobservable, so
we might want to give up on trying to solve for them in the data model equation
• Maybe if we turn the problem around, we can get rid of z and u completely and
focus on Λ
The data model in factor analysis
• We could use the data model, along with some assumptions, to derive a
covariance structure model
• The data model is accompanied by assumptions about the joint distribution of
the elements in z and u implies a model for the population covariance matrix.
• The model for the covariance matrix is known as the covariance structure and is
intended to explain the variances and covariances of the manifest variables, not
the raw data.
• Before we proceed to derive the covariance structure model, we’ll talk about the
important distributional assumptions and lay down some notational rules.
Assumptions
• We will make the following assumptions about the common factors z and unique
factors u:
1. The common factors and the unique factors are independently distributed.
As such, the common factors are uncorrelated with the unique factors. In
other words, 𝜮() = 𝟎 = 𝜮′)(
2. The unique factors are mutually independent. As such, the unique factors
for different MVs are uncorrelated with each other. This implies that the
covariance matrix 𝜮)) is diagonal.
3. The common factors and the unique factors are standardized to have means
of zero.
4. The common factors are also standardized to have unit variances (variances
of 1).
Assumptions – recap
Common factors:
𝑆𝑖𝑛𝑐𝑒: 𝐸 𝒛 = 0
𝑇ℎ𝑒𝑛: 𝐸 𝒛𝒛‘ = 𝜮))
𝐴𝑙𝑠𝑜: 𝐸 𝒛𝒛‘ = 𝜮)) = 𝑹))
Assumptions – recap
Unique factors:
𝑆𝑖𝑛𝑐𝑒: 𝐸 𝒖 = 0
𝑇ℎ𝑒𝑛: 𝐸 𝒖𝒖‘ = 𝜮))
𝐴𝑙𝑠𝑜: 𝐸 𝒖𝒖‘ = 𝜮** = 𝑹** = 𝑰
Assumptions – recap
Relationship between unique and common factors:
𝑹)* = 𝟎 = 𝑹‘*)
Deriving the mean structure
• The mean and covariance structures are derived from the data model:
𝒙 = 𝝁 + 𝚲𝒛 + 𝒖
• Let’s derive the mean structure first. We want an equation that represents the
mean vector μ of the manifest variables.
Deriving the mean structure
• If we take the expectation of both sides of the equation above, we get:
𝐸(𝒙) = 𝐸(𝝁) + 𝐸(𝜦𝒛) + 𝐸(𝒖)
𝐸(𝒙) = 𝝁 + 𝜦𝐸(𝒛) + 𝐸(𝒖)
• Given the assumptions we previously talked about, this follows:
𝝁* = 𝝁 + 𝜦𝟎 + 𝟎
𝝁* = 𝝁
Therefore, means do not depend on the model!
Deriving the covariance structure
• Alright. Let’s consider the derivation of the covariance structure.
• We know that it scores on the MVs are supposed to be weighted
linear combinations of common factors and unique factors:
𝑿 = 𝚲𝒁 + 𝚿𝑼
• Right? Regression style.
Deriving the covariance structure
• First, let’s standardize the vector of MVs to make our lives easier.
• This way, 𝚺"" becomes 𝑹""
𝑿 = 𝚲𝒁 + 𝚿𝑼
• We also know that in this case, 𝑹"" = E(𝐗𝐗‘)
• So:
𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝚲𝒁 + 𝚿𝑼 ‘ ]
Deriving the covariance structure
𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝚲𝒁 + 𝚿𝑼 ‘ ]
• Now, let’s transpose the second bracket (it simply transposes all the
elements while switching the positions of products):
𝑹"" = 𝐸[ (𝚲𝒁 + 𝚿𝑼) 𝒁‘𝚲‘ + 𝑼‘𝚿‘ ]
• Now, let’s multiply the contents of the expectation:
𝑹"" = 𝐸[ 𝚲𝒁𝒁‘𝚲‘ + 𝚲𝒁𝑼‘𝚿‘ + 𝚿𝑼𝒁‘𝚲‘ + 𝚿𝑼𝑼‘𝚿‘]
Deriving the covariance structure
𝑹"" = 𝐸[ 𝚲𝒁𝒁‘𝚲‘ + 𝚲𝒁𝑼‘𝚿‘ + 𝚿𝑼𝒁‘𝚲‘ + 𝚿𝑼𝑼‘𝚿‘]
• Now, we just get rid of the expectation (just like I got rid of expecting to
finish my Ph.D. on time):
𝑹!! = 𝐸 𝚲 𝐸 𝒁𝒁‘ 𝐸 𝚲‘ + 𝐸 𝜦 𝐸 𝒁𝑼‘ 𝐸 𝚿‘ + 𝐸 𝚿 𝐸 𝑼𝒁‘ 𝐸 𝜦’ + 𝐸 𝚿 𝐸 𝑼𝑼‘ 𝐸(𝚿‘)
• All the factor weights (loadings) 𝚲 and 𝚿 are constants:
𝑹"" = 𝚲𝐸 𝒁𝒁‘ 𝚲‘ + 𝚲E 𝐙𝐔‘ 𝚿‘ + 𝚿𝐸 𝑼𝒁‘ 𝚲‘ + 𝚿𝐸 𝑼𝑼‘ 𝚿‘
Deriving the covariance structure
• 𝐸 𝒁𝒁‘ is the variance-covariance matrix of the common factors 𝜮++
• 𝐸 𝑼𝑼‘ is the variance-covariance matrix of the unique factors 𝜮,,
• 𝐸 𝒁𝑼‘ and 𝐸 𝑼𝒁‘ are the variance-covariance matrices of the common
and unique factors among themselves 𝜮,+. So, this:
𝑹"" = 𝚲𝐸 𝒁𝒁‘ 𝚲‘ + 𝚲E 𝐙𝐔‘ 𝚿‘ + 𝚿𝐸 𝑼𝒁‘ 𝚲‘ + 𝚿𝐸 𝑼𝑼‘ 𝚿‘
• Becomes this
𝑹"" = 𝚲𝜮++ 𝚲‘ + 𝚲𝜮,+ 𝚿‘ + 𝚿𝜮,+ 𝚲‘ + 𝚿𝜮,, 𝚿‘
Deriving the covariance structure
𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘
• Phew! Now what?
Deriving the covariance structure
𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘
• Phew, now what?
• We get help from our favourite superhero – Mr. Assumptionman – of
course!
Deriving the covariance structure
𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘
• Phew, now what?
• We get help from our favourite superhero – Mr. Assumptionman – of
course! (He’s bold because he’s a matrix. Wait what, that doesn’t
make sense)
Deriving the covariance structure
𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝜮*) 𝚿‘ + 𝚿𝜮*) 𝚲‘ + 𝚿𝜮** 𝚿‘
• We know some stuff about those sigmas:
• 𝜮)( = 𝟎 = 𝜮)(
• 𝜮)) = 𝑰
𝑹"" = 𝚲𝜮)) 𝚲‘ + 𝚲𝟎𝚿‘ + 𝚿𝟎𝚲‘ + 𝚿𝑰𝚿‘
𝑹"" = 𝚲𝜮)) 𝚲’ + 𝚿 𝟐
Notation
• We will use the following notation from now on:
The manifest variable covariance matrix: 𝚺 = 𝚺++
The common factor covariance matrix: 𝚽 = 𝚺((
The unique factor covariance matrix: 𝑫, = 𝚺))
• Note that (because of the assumptions we made), the diagonal elements of 𝚽
are required to be equal to 1. Thus, 𝚽 is a factor correlation matrix.
Fundamental Theorem of FA
𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐
• For 𝚺:
𝚺 = 𝚲𝛟𝚲’ + 𝑫 𝝍
𝚿 𝟐
is a diagonal matrix with uniquenesses on the diagonal
𝑫 𝝍 is a diagonal matrix with unique variances on the diagonal (to scale
the resulting correlation matrix into a variance-covariance matrix)
Implications of this
1. We don‘t need to know the latent scores (common or unique) to
reproduce the variance-covariance matrix of MVs! Whaaaat!
Implications of this
2. The model assumes all correlations between MVs are only due the effect
of the common factors:
𝐸 A𝑿A𝑿‘ = A𝜮 = 𝜦𝝓𝜦’
Or 𝐸 A𝑿A𝑿‘ = A𝜮 = 𝜦𝜦’ for uncorrelated factors
• How well this assumption corresponds with the reality is what model fit
testing is all about. It‘s sometimes called the local independence
assumption.
Implications of this
2. The model assumes all correlations between MVs are only due the
effect of the common factors:
𝐸 \𝑿\𝑿‘ = \𝜮 = 𝜦𝝓𝜦’
This also implies that 𝜆 is a regression coefficient. It tells us how the
value of x changes with a unit change in the factor score. Which is what
we started with so it‘s probably not a surprise.
Implications of this
3. We can use the information from 𝚿 𝟐
to create the so-called reduced
correlation matrix that has communalities (1-uniqueness) on the
diagonal and MV correlations off the diagonal.
𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐
𝑹"" − 𝚿 𝟐
= 𝚲𝛟𝚲’
• This will be useful later on
Implications of this
4. 𝚲 has different elements for 𝑹"" and 𝚺.
• As we already covered, 𝚿 𝟐 contains uniquenesses while 𝑫$ contains (unscaled) unique
variances
• Similarly, we distinguish 𝚲 and 𝚲*, the second containing standardized factor loadings 𝜆∗
𝜆&
∗
=
'!
()!
Or in matrix-speak:
𝚲∗ = 𝐃*
+ ⁄- !
𝚲
where 𝐃*
+ ⁄- !
is a diagonal matrix with reciprocal SDs (1/SD) of MVs on the diagonal
An example
• Consider the covariance structure for uncorrelated (orthogonal) factors to better
understand the relationship between elements of 𝚺 and the elements of 𝚲 and𝑫,. An
example:
𝚺 =
𝜎&&
𝜎'& 𝜎''
𝜎%& 𝜎%' 𝜎%%
𝜎-& 𝜎-' 𝜎-% 𝜎--
=
𝜆&& 𝜆&'
𝜆'& 𝜆''
𝜆%& 𝜆%'
𝜆-& 𝜆-'
𝜆&& 𝜆'& 𝜆%& 𝜆-&
𝜆&' 𝜆'' 𝜆%' 𝜆-'
+
𝜓&&
𝜓''
𝜓%%
𝜓--
• This shows us that 𝜎&& = 𝜆&&
'
+ 𝜆&'
'
+ 𝜓&&
• Also, 𝜎'& = 𝜆'& 𝜆&& + 𝜆'' 𝜆&'
• The covariance between two MVs is the sum of the products of their loadings on the
common factors
• The variance of an MV is the sum of its squared loadings and its unique factor variance
An example
An example
• The factor loading matrix is:
• The covariance between PC and VO:
𝜎'& = 𝜆'& 𝜆&& + 𝜆'' 𝜆&' = 0.7 ∗ 0.7 + 0.0 ∗ 0.1 = 0.49
• Let’s compute the communality and the unique variance of PC by hand
Communality
• The j-th diagonal element ψjj of 𝑫, is the j-th unique variance. The j-th
communality (proportion of variance of MV j due to common factors) can be
written as:
ℎ"" =
[𝚲𝚽𝚲.]""
𝜎""
= 1 −
𝜓""
𝜎""
• If the factors are uncorrelated, then:
ℎ"" =
[𝚲𝚲.]""
𝜎""
= 1 −
𝜓""
𝜎""
...that is, the sum of squares of row j of 𝚲 divided by the variance of the j-th MV.
Recap
• We wanted to predict MVs as a weighted combination of common
and unique factor scores:
𝑿 = 𝚲𝒁 + 𝚿𝑼
• But we don‘t know the scores, so, instead, we look at their covariance
structure
• That‘s how we got here:
𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐
Recap
𝑹"" = 𝚲𝛟𝚲’ + 𝚿 𝟐
• 𝚲 = matrix of factor loadings (k x p)
• 𝛟 = matrix of factor correlations (p x p)
• 𝚿 𝟐
= diagonal matrix of uniquenesses (k x k)
• You can also scale this into a covariance equation as shown before.
Estimation
Estimation of FA parameters
• Now we need to think about how to find the values to place into the
model matrices 𝚲,𝛟, 𝚿 𝟐
• Let‘s start with an ideal scenario where the factors are uncorrelated
(so 𝛟 = 𝑰 ) and our observed correlation matrix is the population
correlation matrix 𝐏 (i.e., there is no sampling error involved)
Rotational indeterminancy
𝐏 = 𝚲𝚲:
+ 𝑫 𝝍
• But wait! Even in this scenario, things are weird:
𝐏 = 𝚲 𝟏 𝚲 𝟏
:
+ 𝐃 𝝍 = 𝚲 𝟐 𝚲 𝟐
:
+ 𝐃 𝝍 = 𝚲 𝟑 𝚲 𝟑
:
+ 𝐃 𝝍 = ⋯
What the hell! Are you telling me there are infinite possible 𝚲s?
Rotational indeterminancy
• Hell yeah! If you have more than two factors, there is no unique solution to
be found.
• Suppose I’m not wrong and it indeed holds that:
𝐏 = 𝚲 𝟏 𝚲 𝟏
.
+ 𝐃 𝝍 = 𝚲 𝟐 𝚲 𝟐
.
+ 𝐃 𝝍
(we’re just considering two solutions now, but there are infinitely many)
• In that case, one solution (𝚲 𝟐) has to be linked in some way with the other
(𝚲 𝟏). To be precise, 𝚲 𝟐 = 𝚲 𝟏 𝐓 where T is a m x m orthogonal matrix (𝐓𝐓.
=
𝐈)
Rotational indeterminancy
• In that case, one solution (𝚲 𝟐) has to be linked in some way with the
other (𝚲 𝟏). To be precise, 𝚲 𝟐 = 𝚲 𝟏 𝐓 where T is a m x m orthogonal
matrix (𝐓𝐓:
= 𝐈)
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝐓(𝚲 𝟏 𝐓):
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝐓(𝐓′𝚲 𝟏′)
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝐓𝐓′𝚲 𝟏′
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝐈𝚲 𝟏′
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝚲 𝟏′
• See? 𝚲 𝟏 and 𝚲 𝟐 are equally fine solutions.
Rotational indeterminancy
• In other words, if we can find one solution, we can find other
alternative solutions. We simply choose any matrix T such that TT’ = I
and we define 𝚲 𝟐 = 𝚲 𝟏 𝐓
• We’ve just seen that 𝚲 𝟏 and 𝚲 𝟐 are equally good solutions, since
𝚲 𝟐 𝚲 𝟐
:
= 𝚲 𝟏 𝚲 𝟏′
• This is called rotational indeterminacy.
Rotational indeterminancy
• We must resolve this problem somehow if we want to find a single,
unique solution for 𝚲 every time we perform a factor analysis.
• In other words, we need to find a criterion for defining this unique
solution.
• Luckily, we can arrive at a solution with the help of Eigenvalues and
Eigenvectors.
Eigenvalues, eigenvectors, and 𝚲
• Recall that the eigenstructure of a symmetric matrix S is the following:
𝐒 = 𝐔𝐃𝒍 𝐔:
...where the columns of U are eigenvectors and the diagonal
elements of 𝐃𝒍 are eigenvalues (this is a diagonal matrix).
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• Now, let’s take a look at Hypothetical Scenario 1:
• You know the true 𝐏
• You know the unique factor variances (𝚿 𝟐), and thus you also know the
communalities (diagonal of 𝐏 − 𝚿 𝟐)
• The model holds perfectly in the population data
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• In this scenario, obtaining 𝚲 is actually quite easy
• You take the reduced correlation matrix (𝐏 − 𝚿 𝟐
)
• Because 𝚿 𝟐
is a diagonal matrix containing uniquenesses, the
diagonal of (𝐏 − 𝚿 𝟐
) will contain (1 – uniquenesses), thus, it will
contain the true communalities
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• Perform the eigenvalue-eigenvector decomposition of (𝐏 − 𝚿 𝟐
),
which will yield some eigenvectors 𝐔 and some eigenvalues 𝐃𝒍
• Order the eigenvalues by size from largest to smallest
• The first m eigenvalues will be non-zero, the rest will be zero (why?)
• Keep these non-zero eigenvalues and their associated eigenvectors
• Note: This is the same as PCA, only done on 𝐏 − 𝚿 𝟐
instead of 𝐏
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• Keep only the nonzero eigenvalues in 𝐃𝒍, take their square roots and
put them back again into a matrix we will call 𝐃𝒍
𝟏/𝟐
• Then, calculate 𝚲 = 𝐔𝐃𝒍
𝟏/𝟐
• Magic.
• Let’s look at an example, using the example data we have seen before.
• The matrix P is given as follows:
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• Assume the unique variances are known:
𝚿 𝟐
=
0.50
.51
.50
.28
• So the matrix P with communalities in the diagonal is given by:
𝑷 − 𝑫? =
.50
.49 .49
.14 .07 .50
.48 .42 .48 .72
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• We can obtain the eigenvalues and eigenvectors of 𝑷 − 𝚿 𝟐
• The non-zero eigenvalues are:
𝑫@ = 1.662
.548
• And the corresponding eigenvectors:
𝐔 =
.502 −.386
.461 −.500
.353 .731
.641 .259
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• The factor loading matrix can be obtained: 𝚲 = 𝐔𝐃𝒍
𝟏/𝟐
𝚲 =
.647 −.285
.594 −.370
.455 .541
.826 .192
• Wait…that’s not the loading matrix I have shown you last time for the
example data, is it?
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• It’s a transformation of the matrix I have shown you earlier, in the
rotational indeterminacy sense, 𝚲 𝟐 = 𝚲 𝟏 𝐓
𝚲 =
.647 −.285
.594 −.370
.455 .541
.826 .192
=
.70 .10
.70 .00
.10 .70
.60 .60
.848 −.529
.529 .848
• Both 𝚲 matrices provide an exact solution to the model. The procedure
involving eigen-stuff allowed us to identify the unique solution, though.
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
• Okay, so, I have just shown you how to obtain the solution (𝚲) if:
• You know the population correlation matrix, P
• You know the contents of 𝚿 𝟐
, so you know the unique variances or
(conversely) the communalities
• The model holds exactly in the population
• Huh. Putting the “model holds exactly” thing aside, you will never know
P and you will never know 𝚿 𝟐
, so this is a theoretical scenario.
I: We know 𝐏, 𝚿 𝟐
and the model holds perfectly
II: We know 𝐏 and the model holds perfectly. We
don‘t know 𝚿 𝟐
• As I said, the solution obtained by doing the
eigen-decomposition of 𝑷 − 𝑫? requires that you know either the
unique variances or the communalities (once you know one, you know
the other, right?)
• But we don’t know these, since finding out what they are is a part of the
problem we face.
• When factor analysis was young, this was called the “Communality
problem”
• Many solutions were suggested to the communality problem.
• The one that “won” (was and is the most widely used) was suggested by
Louis Guttman in 1940.
• Guttman suggested squared multiple correlations (SMCs) as the initial
approximations to communalities.
II: We know 𝐏 and the model holds perfectly. We don‘t
know 𝚿 𝟐
• Just what is a squared multiple correlation (SMC)?
• Imagine you have p manifest variables. You can try to predict the j-th
manifest variable from the other (p - 1) manifest variables, linear
regression-style.
• This prediction will be imperfect. You can correlate these predicted
values of the j-th manifest variable with the actual values of the variable.
What you will get is a correlation coefficient, the multiple correlation
coefficient. Square it and you get the SMC.
II: We know 𝐏 and the model holds perfectly. We don‘t
know 𝚿 𝟐
• Guttman has shown that if the factor model applies to the population
correlation matrix P, then the squared multiple correlation of the j-th
manifest variable on the other (p – 1) manifest variables is the lower
bound for the communality of the j-th manifest variable.
• So, not knowing the contents of 𝑫?, one might approximate the
manifest variable communalities with manifest variable SMCs, computed
from P. These approximations can then be substituted into the diagonal
of P and one can, again, use the eigenvalue-eigenvector approach on this
modified P matrix to obtain 𝚲.
II: We know 𝐏 and the model holds perfectly. We don‘t
know 𝚿 𝟐
• However, in order to obtain the population SMCs, we need to know P in
the first place. Most often, we don’t.
• In practice, we can apply the same procedure to a sample correlation
matrix, R, in order to obtain sample SMCs.
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• So far, we have studied factor analysis limiting ourselves to the ideal
scenario in which we know the population correlation matrix, P. Moreover,
we only considered the case where the model holds exactly in the
population.
• Now, let’s consider the real world in which we do not have access to P but
we do have access to R. In this scenario, we are not even sure the sample
correlation matrix R is drawn from a population with a correlation matrix P
for which the model holds.
• As before, let’s just consider the uncorrelated / orthogonal model for now.
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• First of all, we should tone down the optimism. In our hypothetical
scenarios, we could select 𝚲 and 𝚿 𝟐
to reconstruct P perfectly:
𝐏 = 𝚲𝚲:
+ 𝚿 𝟐
• In reality, our estimates of 𝚲 and 𝚿 𝟐
, \𝚲 and p𝚿 𝟐 , will generally not be
able to exactly reproduce our sample correlation matrix R:
𝐑 ≈ \𝚲\𝚲:
+ p𝚿 𝟐
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• So, what we want is a parsimonious model (m << p) that provides a
relatively good approximation to the data we have observed.
• This degree of approximation (how well the model fits the data) is
reflected in the residual matrix, defined as 𝐑 − \𝚲\𝚲:
+ p𝚿 𝟐
• The residual matrix tells us how far away the correlation matrix R we
have observed is from the correlation matrix the model predicts. In
other words, how far is the observed correlation matrix from the modelimplied
correlation matrix (which is simply \𝚲\𝚲:
+ p𝚿 𝟐)
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• Every element in the residual matrix tells us how far is the model-implied
(predicted) value of this element from its observed value.
• Alright, so – again, we don’t have a population correlation matrix P
which we used for all the computations and methods covered before.
What are we going to do?
• Of course, we’re going to pretend like the problem isn’t there and we’ll
start by doing things in the exact same way.
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• Again, we will obtain some eigenvalues and some eigenvectors.
However, in this case (not having a population correlation matrix, not
being sure the model holds exactly in the population), we will generally
not obtain an eigen-solution where the (p – m) smallest eigenvalues are
zero.
• Thus, we cannot rely on the number of non-zero eigenvalues to show us
the “true” number of factors (m). Thus, we will have to choose m
ourselves beforehand, based on our best judgement
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• Thus, having chosen the number m beforehand, we will take the m
largest eigenvalues and their corresponding m eigenvectors.
• Just like before, we will take the square root of the eigenvalues, sort
them by size and place them in a diagonal matrix \𝑫@(
'/!
• And, just like before, we will create a matrix \𝑼( with the corresponding
eigenvectors as columns.
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• Then, we can use the eigenvalues and eigenvector matrices to compute
our estimate of factor loadings:
\𝜦 = \𝑼(
\𝑫@(
'/!
• The \𝜦 obtained in this way minimizes the residual sum of squares (RSS):
𝑅𝑆𝑆 =
1
2
.
"#$
%
.
&#$
%
[ 𝑹 − 2𝚿 𝟐 − 5𝚲5𝚲′]"&
!
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
• This \𝜦 results in minimum sum of squared residuals, conditional on the
given set of prior communality estimates.
• This method is known as the principal factor method using prior
communality estimates
III: The model does not hold perfectly. We
don‘t know 𝚿 𝟐
or 𝐏
Short review
• So, what was the principle behind the principal factor method using prior
communality estimates? Let’s do a short recap:
1) First, we obtain some communality estimates (like SMCs) and plug them
into the diagonal of R. Thus, we get our estimate of 𝑹 − 𝚿 𝟐
2) Then, we obtain the eigen-solution of 𝑹 − 𝚿 𝟐
3) We use the eigen-solution to obtain \𝚲
4) What we just got is a solution that minimizes the Residual Sum of
Squares (RSS) given our initial p𝚿 𝟐
Iterative procedure
• We will start by doing things the same way we did previously, using the
principal factors method:
• 1) First, we obtain some communality estimates (like SMCs) and plug
them into the diagonal of R. Thus, we get our estimate of 𝑹 − 𝚿 𝟐
• 2) Then, we obtain the eigen-solution of 𝑹 − 𝚿 𝟐
• 3) We use the eigen-solution to obtain \𝚲
• …but we won’t end here. We will use the computed \𝚲 to obtain new
communality estimates by summing the squared elements in each row
of \𝚲 (diagonal elements of \𝚲\𝚲:
)
Iterative procedure
• We shall take the new communality estimates and plug them into the
diagonal of R. Thus, we get a new 𝑹 − 𝚿 𝟐
• Again, we obtain the eigen-solution of this new 𝑹 − 𝚿 𝟐 and use it to
compute a new \𝚲
• …and repeat (use the newly computed \𝚲 to again obtain new
communality estimates). We continue this process until the
communalities obtained in successive iterations do not significantly
differ by some pre-set criterion (convergence criterion).
Iterative procedure
• That’s really all there is (in principle) about OLS.
• By the way, the RSS function (the formula we have seen before) is a
discrepancy function – it quantifies the distance between the observed
and model-implied correlation matrices. In other words, it expresses the
degree of model misfit.
• Being a discrepancy function, it is always greater than or equal to zero
and is zero only when the observed and model-implied correlation
matrices are the same.
Heywood cases
• One nasty thing can happen when using OLS estimation
• That is, some communalities can, in the course of the iterations, be
greater than one. Conversely, the unique variances can become less than
zero (because in a standardized solution, the communality and the
unique variance of an MV add up to one)
• But there’s no such thing as negative variance. Thus, such a solution
would be nonsensical and unacceptable. We call these occurrences
Heywood cases
Summary
• We considered multiple scenarios of fitting the model to data. Let’s do a
quick review.
• 1) You know P and you know 𝚿 𝟐
. You can obtain the eigen-solution of
(P -𝚿 𝟐
) to compute 𝚲.
….however, this will never be the case in practice.
Summary
• We considered multiple scenarios of fitting the model to data. Let’s do a
quick review.
• 2) You know P but you do not know 𝚿 𝟐
. You can estimate communalities
using SMCs and plug them into the diagonal of P to obtain (P - 𝚿 𝟐
).
Afterwards, you obtain the eigen-solution of (P - 𝚿 𝟐
) to obtain 𝚲.
….however, this will also never be the case in practice.
Summary
• We considered multiple scenarios of fitting the model to data. Let’s do a
quick review.
• 3) You do not know P and you do not know 𝚿 𝟐
. All you have is R. You
can estimate communalities using SMCs and plug them into the diagonal
of R to obtain (R - p𝚿 𝟐). Obtain the eigen-solution of (R - p𝚿 𝟐) to get \𝚲.
….the solution minimizes RSS given your original p𝚿 𝟐. This can happen very
often in practice, although we would normally use a better option coming
up next.
Summary
• We considered multiple scenarios of fitting the model to data. Let’s do a
quick review.
• 4) You do not know P and you do not know 𝚿 𝟐
. All you have is R. You
can estimate communalities using SMCs and plug them into the diagonal
of R to obtain (R - p𝚿 𝟐). Obtain the eigen-solution of (R - p𝚿 𝟐) to get \𝚲.
Use the computed \𝚲 to obtain new communality estimates from the
diagonal of \𝚲\𝚲:
. Return to the beginning with fresh new communality
estimates, repeat until convergence.
Intermezzo
• Phew! We‘ve covered a lot of ground, right?
• And that was still just the foundation of the unrestricted FA
• Now, let‘s look at stuff directly applicable to the restricted FA that is
the main variation of FA one should learn / use
Hints
1/ He was 23 when the photo was taken
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
3/ He was a professor of eugenics
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
3/ He was a professor of eugenics
4/ Opposed Bayesian stats
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
3/ He was a professor of eugenics
4/ Opposed Bayesian stats
5/ Coined the „sexy son hypothesis“
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
3/ He was a professor of eugenics
4/ Opposed Bayesian stats
5/ Coined the „sexy son hypothesis“
6/ Popularized t-distribution
Hints
1/ He was 23 when the photo was taken
2/ Year before that, he created the most
potent estimation tool in statistics
3/ He was a professor of eugenics
4/ Opposed Bayesian stats
5/ Coined the „sexy son hypothesis“
6/ Popularized t-distribution
7/ Created the ANOVA
It‘s Ronald Fisher!
Maximum Likelihood upsided
• ML is an ingenious method of estimating parameters
• It has beautiful properties:
• Consistency = with increasing N, ML estimates (MLE) converge to the true
values
• Efficiency = MLE is the „best“ way to get these estimates
• Asymptotic normality = with increasing N, MLE are normally distributed
• But this only holds when asssumptions are met
Maximum Likelihood downsides
• Assumptions of ML:
• Large samples! (e.g., Gorsuch claims that for FA, N = 200 is probably enough if
you have few variables, while N = 50 is not)
• Multivariate normality (i.e., the variables together are distributed normally)
• But you can assume any kind of distribution
• So it‘s a good idea to check N and the distributions but, luckily, MLE
seems to be pretty robust against violations of the distributional
assumptions at least
• However, MLE is very computationally intensive!
Multivariate normality
• n-dimensional generalization of univariate normality
• Variables are normal together
• When they are MVN, they are also normal each on their own
• But all of them being normal on their own does not imply they are
MVN
• Weird stuff, right?
Multivariate normality
Multivariate normality
Probability density
• Coming back to the 1-dimensional world for simplicity, we can
describe any normal distribution using its mean and variance
• After choosing these values, we can calculate the probability of x
having a certain value given a normal distribution with the preset
mean and variance:
𝑓 𝑥 =
1
2𝜋𝜎!
𝑒
A
"AB !
!C!
= 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑖𝑛𝑔 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 ∗ 𝑒A
'
!
)!
=
1
2𝜋𝜎! 𝑒𝑧!
Probability density
• Well, and then we simply plug in the values.
• Let‘s say we are interested in the probability of x = 5 given mean = 3
and variance = 10
𝑃 𝑥 = 5 | 𝜇 = 3, 𝜎!
= 10 =
1
2𝜋10
𝑒A
DAE !
!∗'G = .10
See? e z (pun intended)
The concept of likelihood
• Likelihood is about turning this problem around.
• You usually don‘t know the parameters of the distribution, right? But
you know the values. So you can ask:
What set of parameters makes my observations the likeliest (most
probable)?
Maximizing the likelihood
• We have a set of observations x and are interested in guessing what
normal distribution they came from
• This requires a slight change in the formula:
ℒ 𝒙 |𝜇, 𝜎!
=
1
2𝜋𝜎!
H
~
'
H
𝑒
A
"AB !
!C!
=
𝑁 ln
1
2𝜋𝜎!
+ •
'
H
ln(−
𝑥 − 𝜇 !
2𝜎!
)
Maximizing the likelihood
• This badass formula creates a
function for which we can find a
maximum
• X axes are mean and variance
• Y axis is the likelihood value
• The top of the mountain is the
maximum likelihood = where we
can find the likeliest combination
of parameters
Computing maximum likelihood
• If you now imagine extending this idea into n dimensions, you
probably see that this forms an n-dimensional likelihood monster
onto which we need to climb to find the top
• Computing it directly (using derivatives as shown in Mulaik) is
possible only in simple cases
• Otherwise, one needs to use an iterative algorithm
MLE iterations
• These algorithms are mostly concerned with navigating the
parameter space efficiently
• You see, the mountain is all the possible values our parameters can
take.
• And there are rules to this madness – if you stumble upon a steep
climb, you are probably on a good track to finding the top!
• So the algorithm computes the steepnes of a climb (2nd derivative) at
a given point and then jumps in a certain direction according to the
results
MLE iterations
• It‘s basically a game of hide and seek
• The goal is to find the top, and the steepnes of the n-dimensional
mountain tells you whether you are hotter or colder
• Isn‘t that beautiful?
MLE algorithms
• As a sidenote, this means that you should not confuse MLE as an
estimation technique with the algorithms of computing the result.
• It‘s always the same ML whether you choose:
• Newton-Raphson
• Expectation-Maximization
• Metropolis-Hastings Robbins-Monro (MCMC)
• They only differ in how they navigate the n-dimensional mountain
Local maxima / minima
• The technical term for the top of the mountain (MLE) is the global
maximum
• However, in some cases, the shape of the mountain can be strange,
having multiple smaller peaks along with the highest peak
• These smaller peaks are called local maxima
• When the alogrithm misidentifies this peak for the top (thinking that
you can only descend, not ascend further from that point), you get a
wrong result
Algorithms – sidenote
• By the way, if you ever heard of the term Hessian matrix, then that‘s
sort of a map to this n-dimensional monster we are exploring
• It is a matrix containing information about how steeply the monster
we are finding the top of rises / descends at various points
• Mathematically, it is a matrix of second partial derivatives at critical
points
Algorithms – sidenote
MLE in FA
• So, the goal of ML estimation is to find such \𝚲 and p𝑫? so that the
value of the -2 x log-likelihood function is minimized (to ease
computations)
• This function is again a discrepancy function – it is always larger than
or equal to 0 and is zero if and only if the model-implied correlation
matrix equals the sample correlation matrix
Maximum likelihood estimation
• The logic of ML estimation is very similar to that of (iterative) OLS:
1) Initial estimates of \𝑫 𝜓 are obtained (by SMCs or other means)
2) A maximum likelihood estimate of 𝚲 is obtained, conditional on the
estimated \𝑫 𝜓
3) A model-implied reduced correlation matrix is obtained, which
completes the first iteration
4) New iteration is initiated with most recent estimates of \𝑫 𝜓
5) Iterations continue until convergence is achieved
Summary
• We have described three different methods for fitting the common
factor model to sample data:
• Principal factors with prior communality estimates (noniterative)
• Ordinary least squares (iterative principal factors)
• Maximum likelihood
• These are methods for fitting the model to data. Many more methods
exist, but OLS and ML are commonly available.
Summary
• But it’s all about some type of discrepancy function that represents
the distance between the observed and the expected
• These functions take different forms as they all define the best
solution differently
• OLS wants to minimize the residual sum of squares
• MLE wants to maximize the likelihood of the parameters given the
observed data
Which is best?
• Minres / PA work when distributional assumptions are broken
• ML works well with large samples and allows model comparison
• However, there are simulation studies that really hype minres stating
it performs the best under most conditions
• I‘d recommend trying both minres and ML and checking whether the
results meaningfully differ (i.e., you would write a different Discussion
section if you were to use the other one)
• Usually, you might not see any real difference, so no need to worry
• This approach is called a sensitivity analysis
What comes next?
• You‘ve learned a LOT (and I have as well alongside you)
• In 2 weeks, we meet for our last Vectors of Mind session (sniff)
• We will talk about applied topics like:
• Model fit assessment
• CFA vs. EFA
• Extremely (and I mean extremely) cool approaches to hypothesis testing in FA
• Good research practices concerning FA
• Reporting standards for CFA
• Bifactor models / ordinal FA