Note: This document was prepared some years ago, and covers material
for which there are many more recent contributions.
See the Further Reading section
for some more recent pointers.
So, you want to do a factor analysis? Apart from understanding a modest
amount of theory, there are a number of practical questions that arise
in any factor analytic study:
There are many ways of answering these questions in the factor
analysis literature and in research ``lore'', but it is important to
understand that there is much art (as well as science) to carrying
out a factor analytic study and judgment is often required.
- What sample size do I need?
How many factors should I extract?
What's a "significant" loading?
What kind of rotation should I do?
material below began as notes I copied from the blackboard in Karl
Jöreskog's factor analysis course at Princeton in 1970/71. Over the
years I've added topics that appeared as Frequently Asked Questions
in consulting and teaching.
This document outlines the phases of a factor analytic study and a
number of the practical questions and issues that need to be
addressed. As an outline, it does not go into much detail. Instead,
you should consult one or more of these sources:
An excellent introductory source for practical information on exploratory
and confirmatory factory analysis is
A step-by-step approach to using the SAS System for factor analysis
and structural equation modeling.
The step-by-step approach uses a set of concrete, substantive social science
research examples to lead you through the steps of questionnaire design,
data input using SAS, interpreting printed output, model revision, and
An older, but in some ways the best source for concrete ideas on the
practical implementation of exploratory factor analytic studies is
A first course in factor analysis includes description of the
design and analysis of a battery of personality scales using factor analytic
Analyzing Multivariate Data, gives a brief description of factor
analytic techniques (3 chapters) which covers most of the ground and
discusses some practical issues and the applicable features of SAS, SPSS,
BMDP, and LISREL.
Descriptions of some of the ideas behind confirmatory model fitting below
borrow from the LISREL 7 User's Guide
( Jöreskog & Sorbom, 1988)
and from Bollen's (1989) book,
Structural equations with latent variables.
Byrne (1990) is a good source for
LISREL analysis of complex CFA models.
Do you really want to do a factor analysis? (Theory construction and
testing vs. data summarization; account for common variance or all variance).
Definition of domain: what kinds of tests to study?
e.g., Guilford's "Structure of Intellect" model presented a
theory which cross-classified any test of intellective performance along
a number of dimensions.
For factor-analytic study of a single construct (e.g., "anxiety")
it is important to have a sufficiently detailed theoretical description
to determine the relevant dimensions of the construct
(e.g., trait-anxiety, state-anxiety, etc.)
Examination of earlier literature: What variables used in previous studies,
factors found, etc.
Formulation of hypotheses:
- How many factors expected?
- What kind of factors? (Orthogonal, oblique, general, group)
- Alternative hypotheses?
Construction & selection of tests
How many variables to use? (p)
Overdetermine the hypothesized factors: Need at least p = 2 variables to
extract a common factor (by definition).
It is better to have at least 3-5 variables believed to measure each factor.
p = 5 × k for safety.
Factor analytic principles and empirical studies suggest it is better to
have more than the minimum number of variables/factor.
As the number of salient variables / factor increases, the communalities,
rotational positions, and factor scores all become better determined.
It appears to be generally more difficult to replicate factors with fewer
than 5 or 6 salient variables for each factor.
- Include pure-factor variables ("markers") wherever possible--
variables expected to load only on that factor.
- Avoid variables which are experimentally dependent-- where the result on one
necessarily dependent on another (e.g., systolic & disystolic BP;
items on questionnaire which are just minor rephrasings or which are based
on the same context).
- Data collection
What population is being sampled?
Define the population to which you want to generalize results.
Take pains to achieve random sampling.
As in all statistics, the validity of inferences is threatened when samples
Avoid restriction of range, i.e., sample is homogeneous on some of the
measures. This reduces the possible size of correlations.
Sample size (N)
The more the better! Reliability and
replicability increase directly with ÖN.
Monte Carlo studies show that more reliable factors can be extracted with
larger sample sizes.
Absolute minimum-- N = 5 ×p, but you should have N >
100 for any serious factor analysis.
Minimum applies only when communalities are high and p / k is high.
Most major factor analytic studies use N > 200, some as high as 500-600.
Safer to use at least N > 10 × p.
The lower the reliabilities, the larger N should be.
Plan for determining the reliability of each measure (e.g., test-retest on
a subsample, or coefficient a for scales/tests composed of items).
Plan for cross-validation (split-sample) or validation (replication).
Reliabilities of tests - gives upper bound on communalities, and good
initial estimates (
PRIORS statement in
Data screening - check for outliers, errors: probability plot of Mahalanobis
squared distances from mean is useful. (Alternatively, the diagonal elements
of H, the "hat" matrix can be used to check for multivariate
Distributions - transformations required? All variables
should be multivariate normal. Lack of normality can
distort the validity of the c2 tests for ML
factor methods. At the least, make sure that all are
reasonably symmetric, and transform any which are
Sample stratification: Are there natural subgroups within the sample which
might differ in either their means or in the pattern of correlation?
If significant differences in means exist, analyze within-cell
correlation matrix (i.e., use
to set the means in each group to zero before computing correlations.)
Alternatively, code the group variable with dummy variables and examine
the correlations with the factor variables.
If different pattern of correlations is expected, consider doing a separate
analysis for each group or testing the hypothesis of equal covariance /
An alternative, which does not require splitting the sample is to include
the dummy variable(s) in the analysis.
If these dummy variables load (strongly) on any factors, the groups differ
on this factor.
Matrix of full rank? (Linear dependencies?) Check the value of the
determinant of the correlation/covariance matrix.
If near zero, delete one or more variables.
R = Identity? (Sphericity test).
If you cannot reject the hypothesis that the variables are all uncorrelated,
you have no business doing factor analysis.
With discrete, ordinal (or binary) measures the alternatives are to treat
them as continuous anyway (use ordinary Pearson correlations) or use
special procedures developed for polychoric (or tetrachoric) correlations.
Robustness studies are mixed; they suggest that discreteness may introduce
little bias in the estimation of parameter values, but may affect standard
errors and ML c2; tests more seriously, particularly when the
number of response categories is small (2-4).
Hence, for exploratory studies the consequences may not be serious.
For confirmatory studies, the PRELIS program, a companion program to
LISREL 7, provides a special method to estimate a covariance matrix from
ordinal data and a weight matrix which is used in LISREL for such data.
See Bollen (1989, 433ff)
and the LISREL 7 Guide for further discussion.
Adequacy of common factor model?
PROC FACTOR with METHOD=ML
gives a test of the hypothesis that k = 0 common factors are necessary.
You should be able to clearly reject this hypothesis.
The common factor model assumes that unique factors are uncorrelated,
but provides no test of this assumption.
LISREL/CALIS models allow this assumption to be tested.
Determining the number of factors
Number of factors = number of eigenvalues > 1.
Unfortunately, this is the default for most factoring programs
(SAS, SPSS, etc.).
There are several heuristic rationales for this rule-of-thumb, but most
evidence indicates it often gives the wrong number of factors.
Scree test - Generally good results if there is a clear "break"
in the plot of eigenvalues. Works best when ratio of p/k is large.
Tests derived from the c2; test of the ML solution are generally
preferred, but not without their own problems.
Examine the matrix of residuals. We want to account for correlations or
covariances. Hence the residual covariances should all be small if the
Chi Square test from Maximum Likelihood solution
Provides a statistical test of fit of the model with k factors,
against the alternative hypothesis that S is unconstrained
(any positive definite symmetric matrix).
The test is based on the following assumptions, which are rarely
fulfilled in practice:
All observed variables have multivariate normal distribution.
The analysis is based on the sample variance-covariance matrix
not the correlation matrix.
The sample size is 'fairly large' (the test relies on the asymptotic
properties of maximum likelihood estimation (i.e., as N ® ¥)
Rather than regarding c2; as a formal test statistic, one should
regard it as a badness of fit measure in the sense that large
c2; values correspond to bad fit and small values correspond to
From this perspective, the statistical problem is not one of testing
a given hypothesis (which may be considered false a priori),
but rather one of fitting the model to the data to decide whether the fit
is adequate or not. With greater N you can extract more statistically
The c2; measure is sensitive to sample size and very sensitive to
departures from multivariate normality. Large sample sizes and departure
from normality tend to increase c2; over and above what can be
expected due to misspecification of the model.
(See CFA below for alternative measures).
A more reasonable way to use the c2 value is to compare the
differences in c2; to the differences in degrees of freedom
as more factors are added to the model. A large drop in c2; compared
to the difference in d.f. indicates that the addition of one more
factor represents a real improvement.
A drop in c2; close to the difference in d.f. indicates that the
improvement in fit is obtained by 'capitalizing on chance', and the
added factor may not have real significance or meaning.
Other goodness of fit measures
There are a large number of alterntaive goodness of fit measures designed
to overcome the limitations of the raw c2; test.
Bollen (1989, 256-289) divides
these into overall fit measures and incremental fit measures (how much
better with one more factor?). Some of these are:
c2; / df: For comparing models with different numbers of factors,
some researchers use the ratio of c2; per degree of freedom, and
interpret a value less than 2.0 as indicating adequate fit.
Like the c2; itself, this index generally increases with sample size.
Tucker - Lewis index.: This index scales the observed c2; to a
range of approximately 0 - 1, where 0 represents the c2; obtained
from a null model, and 1 represents an ideal fit.
The idea is
similar to the use of h2 or w2 in ANOVA
as a measure of proportion of variance explained. The
where the subscript 0 refers to the null model (no common factors: all
variables uncorrelated) and m refers to the model being tested.
Marsh, Balla &
McDonald (1988) found TLI to be the only widely-used goodness of fit
index which is not affected by sample size.
Note, however, that rather large values are typically found, since
fit is compared to a baseline null model: a value of TLI <
.90 usually means that the model can be improved substantially
( Bentler & Bonnet, 1980).
TLI = ||
c20 / df0 -c2m / dfm |
c20 / df0 - 1
AIC. : Akaike's Information Criterion is becoming more
widely used as a criterion for comparing models which vary in the
number of free parameters. The essential idea is to penalize models
with more free parameters, since they are more likely to fit.
A related index is Schwartz' BIC measure, which includes the sample size
in the penalty.
AIC = c2m + 2 t BIC = c2m + t log( N ) |
Chose the model which gives the smallest value of AIC or BIC.
- Need rotation only because we can't visualize in k dimensions,
but have to look at table of loadings.
All rotations are equally good from a mathematical standpoint, but may differ
substantively or in terms of interpretability.
Does prior evidence warrant assumption of orthogonal factors?
If so, use orthogonal rotation (e.g., varimax); otherwise use oblique
rotation (e.g., promax).
- Oblique rotations typically give simpler structure (loadings),
at expense of having to also interpret factor correlations.
If oblique rotation is used, do the factor correlations differ from zero?
Does the pattern of rotated loadings fit with hypotheses?
In oblique solutions, it is often better to interpret the factor
structure matrix (correlations of the variables with the factors)
than the factor pattern matrix (loadings: weights used to
calculate variable standard scores from factor standard scores).
[These are the same in orthogonal solutions.]
Moreover, the factor structure may be expected to remain stable over shifts
in other factors which appear in an analysis, while the factor pattern
usually does not.
Many people use | l | ³ .3 or .4 as a criterion for salient
loadings without any justification. Monte Carlo studies by Pennell (1968)
and Cliff &
Hamburger (1967) suggest that the correlations in the factor structure may
be judged roughly using the formula for the standard error of an ordinary
raw correlation doubled (i.e., 2 ×(1 - r2 ) ¸Ön) to
accommodate capitalization on chance. Horn (1967) and
Humphreys et al. (1969) show that loadings arising by chance
can be of impressive size. The less exploratory the study, the less
capitalization on chance can occur.
The vague basis for the .3-.4 rule of thumb appears to be this:
With N=100, the minimum significant correlation at p< .05 is about 0.2.
Doubling this gives 0.4.
By this rule of thumb, interpreting a structure correlation of 0.3
as significant would require N>175.
Note that with very large sample sizes, loadings so small as to be
uninterpretable may still be significant. This may be another reason
for the popularity of 0.3 as an absolute minimum.
All interpretations of factors are post hoc unless subjected to confirmation
by (cross-)validation or confirmatory hypothesis testing.
Should be regarded as tentative, subject to further research rather than as final.
Procustes rotation to specified factor pattern-- Specify the pattern of
zero and non-zero loading, and attempt to rotate the loading matrix to
this pattern. How close can you come?
- Many of these arbitrary rules-of-thumb disappear when CFA is used!
Reformulation of hypotheses and/or tests
Summarize the discrepancies between the hypotheses and the rotated solution.
Might there be a need to redesign or replace any of the tests?
If measures of reliability are available, the unique variance can be
partitioned into unreliability (error variance) and specific variance.
Measures with very small communalities (large unique variance) do not
measure what the other tests measure. Perhaps you need to add other
indicators of what these tests measure.
Measures with very large communalities (> .95) cause numerical
problems in maximum likelihood solutions ("Heywood cases").
Cross validation or replication
Factors should be regarded as tentative until replicated. Just as in
regression, the Factor Analysis model is fit by maximizing
goodness-of-fit in the sample, i.e., by minimizing some function,
F ( S , [^(S)] ) of the difference between the actual and fitted
covariance matrix. A future sample will not fit as well .
Some people suggest splitting the sample into halves in an exploratory study,
using one-half to develop hypotheses, and the other half for confirmatory
testing. This reduces the effective sample size for either part of the study,
but it does provide for validation within a single study. The halves should
be randomly determined.
In the cross-validation design, the sample is split into random
half-samples, 1 and 2. Let S 1 and S 2 be the variance-covariance
matrices for the two sub-samples, and let [^(S)] k | 1 and
[^(S)] k | 2 be the reproduced
from fitting structural model M k to samples 1 and 2.
Then an index of cross-validatation is the goodness of fit measure
between the data for sample 1, S1, and the fitted
matrix for sample 2, [^(S)]k | 2 (and
Number of factors
Specify the number of factors based on exploratory analyses.
Pattern of loadings: fixed vs. free parameters.
Specify a hypothesis by constraining certain parameters in the factor
matrices to be zero. The hypothesis is confirmed to the extent that
the model still fits.
Note: For a LISREL factor analysis model to be identified,
it is necessary to fix at least one loading on each factor to a non-zero
value (e.g., 1.0) in order to fix the measurement scale of that factor.
The LISREL program calculates "modification indices" for
each fixed and constrained parameter;
PROC CALIS calls these
"Lagrange multiplier" tests. This index is the expected decrease
in c2; if a single constraint in the hypothesis is relaxed, and all
estimated parameters are held fixed at their estimated values.
Each modification index is a c2 with 1 df, and the parameter with
the largest index will improve fit maximally. Relaxing parameters based on
the modification index is only recommended when the parameter(s) freed
make sense from a substantive point of view.
Similarly, the t-values for each free parameter provide a test of the
hypothesis that the parameter equals 0.
PROC CALIS provides
a Wald test statistic as well, which is a 1 df c2; value.
Both statistics evaluate whether a restriction (setting the parameter = 0)
can be imposed on the estimated model.
Nested hypotheses and difference in Chi Square.
As described above under "Determining the number of factors",
the c2; statistic is best regarded as a measure of badness of fit
of the hypothesis.
It makes sense to compare a series of hypotheses, H 1, H 2,
H 3, ..., such that H 1 is the most stringent or restricted
hypothesis, and H 2, H 3, ... successively relax some of the
restrictions. If H i is wholly included in H i+1, then the
difference in c2; between them can be regarded as a test of
the parameters that are fixed in H i but free in H i+1 .
For example if H 1 and H 2 both specify the same factor pattern,
but H 1 fixes the factor correlations, F = I, while H 2
allows factor correlations to be free, the c2; difference is
attributable to the correlations among the factors.
H(1-2): F = I test by Dc2 = (c12 - c22) on Ddf = (df1 - df2) d.f. |
Other measures of goodness of fit.
LISREL and PROC CALIS also give other indices which are
useful in assessing the fit of a hypothesis:
Goodness of fit index (GFI): A measure of the relative amount of
variances and covariances accounted for by the model, and an adjusted
goodness of fit value (AGFI), adjusted for degrees of freedom. Both
measures are between 0 and 1, where 1=perfect fit. Unlike the c2; ,
Jöreskog & Sorbom (1984) claim that both the GFI and AGFI index are
independent of sample size and relatively robust against departure from
normality. Their distributional properties are unknown, however, so there
is no significance test associated with them.
It should be emphasized that the measures c2, GFI, and AGFI are
measures of the overall fit of the model to the data and do not
express the quality of the model by any other criteria. For example,
it can happen that the overall fit of the model is very good, but one or
more relationships in the model is poorly determined (as indicated by the
squared multiple correlations), or vice versa. Furthermore, if any of
the overall measures indicates that the model does not fit well, that fact
does not tell what is wrong with the model. Diagnosing what part of the
model is wrong can be done by inspecting the normalized residuals (which
correlations are not well fit?) and/or the modification indices
(which fixed parameters might be relaxed?).
- Absolute vs. relative measures. There are now many new measures of goodness-of-fit, but they may be classified as
- absolute measures:
how well does this model fit? vs.
- relative measures:
how well does this model fit compared to- the null model, the saturated model, or a simpler model.
Of all available software, only AMOS provides these model comparison
when you fit a series of models.
macro provides similar model comparison statistics
for a set of models fit using PROC CALIS.
- Bentler & Chou (1987). Practical issues in structural modeling
Sociological Methods and Research, 16(1), 78-117.
- LISREL 8 and PRELIS2: Getting Started
A detailed illustrated introductory guide
to LISREL and PRELIS from the University of Texas
- Structural equation models:
© 1995 Michael Friendly