Planning a Factor Analytic study

Michael Friendly
Psychology 6140

Note: This document was prepared some years ago, and covers material for which there are many more recent contributions. See the Further Reading section for some more recent pointers.

Background

So, you want to do a factor analysis? Apart from understanding a modest amount of theory, there are a number of practical questions that arise in any factor analytic study:

What sample size do I need?
How many factors should I extract?
What's a "significant" loading?
What kind of rotation should I do?

There are many ways of answering these questions in the factor analysis literature and in research ``lore'', but it is important to understand that there is much art (as well as science) to carrying out a factor analytic study and judgment is often required.

The material below began as notes I copied from the blackboard in Karl Jöreskog's factor analysis course at Princeton in 1970/71. Over the years I've added topics that appeared as Frequently Asked Questions in consulting and teaching.

This document outlines the phases of a factor analytic study and a number of the practical questions and issues that need to be addressed. As an outline, it does not go into much detail. Instead, you should consult one or more of these sources:

An excellent introductory source for practical information on exploratory and confirmatory factory analysis is Hatcher (1994), A step-by-step approach to using the SAS System for factor analysis and structural equation modeling. The step-by-step approach uses a set of concrete, substantive social science research examples to lead you through the steps of questionnaire design, data input using SAS, interpreting printed output, model revision, and writing results.
An older, but in some ways the best source for concrete ideas on the practical implementation of exploratory factor analytic studies is Gorsuch (1983), Factor Analysis. Comrey (1973), A first course in factor analysis includes description of the design and analysis of a battery of personality scales using factor analytic techniques. Cliff (1987), Analyzing Multivariate Data, gives a brief description of factor analytic techniques (3 chapters) which covers most of the ground and discusses some practical issues and the applicable features of SAS, SPSS, BMDP, and LISREL.
Descriptions of some of the ideas behind confirmatory model fitting below borrow from the LISREL 7 User's Guide ( Jöreskog & Sorbom, 1988) and from Bollen's (1989) book, Structural equations with latent variables. Byrne (1990) is a good source for LISREL analysis of complex CFA models.

Phases in a Factor Analytic Study

I. Reconnaissance Stage

Preparatory Planning
1. Do you really want to do a factor analysis? (Theory construction and testing vs. data summarization; account for common variance or all variance).
2. Definition of domain: what kinds of tests to study? e.g., Guilford's "Structure of Intellect" model presented a theory which cross-classified any test of intellective performance along a number of dimensions. For factor-analytic study of a single construct (e.g., "anxiety") it is important to have a sufficiently detailed theoretical description to determine the relevant dimensions of the construct (e.g., trait-anxiety, state-anxiety, etc.)
3. Examination of earlier literature: What variables used in previous studies, factors found, etc.
4. Formulation of hypotheses:
  - How many factors expected?
  - What kind of factors? (Orthogonal, oblique, general, group)
  - Alternative hypotheses?
Construction & selection of tests
1. How many variables to use? (p)
  - Overdetermine the hypothesized factors: Need at least p = 2 variables to extract a common factor (by definition). It is better to have at least 3-5 variables believed to measure each factor. p = 5 × k for safety.
  - Factor analytic principles and empirical studies suggest it is better to have more than the minimum number of variables/factor.
  - As the number of salient variables / factor increases, the communalities, rotational positions, and factor scores all become better determined.
  - It appears to be generally more difficult to replicate factors with fewer than 5 or 6 salient variables for each factor.
2. Include pure-factor variables ("markers") wherever possible-- variables expected to load only on that factor.
3. Avoid variables which are experimentally dependent-- where the result on one variable is necessarily dependent on another (e.g., systolic & disystolic BP; items on questionnaire which are just minor rephrasings or which are based on the same context).
Data collection
1. What population is being sampled? Define the population to which you want to generalize results.
2. Take pains to achieve random sampling. As in all statistics, the validity of inferences is threatened when samples are non-random.
3. Avoid restriction of range, i.e., sample is homogeneous on some of the measures. This reduces the possible size of correlations.
4. Sample size (N)
  - The more the better! Reliability and replicability increase directly with ÖN.
  - Monte Carlo studies show that more reliable factors can be extracted with larger sample sizes.
  - Absolute minimum-- N = 5 ×p, but you should have N > 100 for any serious factor analysis. Minimum applies only when communalities are high and p / k is high. Most major factor analytic studies use N > 200, some as high as 500-600.
  - Safer to use at least N > 10 × p.
  - The lower the reliabilities, the larger N should be.
5. Plan for determining the reliability of each measure (e.g., test-retest on a subsample, or coefficient a for scales/tests composed of items).
6. Plan for cross-validation (split-sample) or validation (replication).
Descriptive analysis
1. Reliabilities of tests - gives upper bound on communalities, and good initial estimates (PRIORS statement in PROC FACTOR).
2. Data screening - check for outliers, errors: probability plot of Mahalanobis squared distances from mean is useful. (Alternatively, the diagonal elements of H, the "hat" matrix can be used to check for multivariate outliers.)
3. Distributions - transformations required? All variables should be multivariate normal. Lack of normality can distort the validity of the c² tests for ML factor methods. At the least, make sure that all are reasonably symmetric, and transform any which are highly skewed.
4. Sample stratification: Are there natural subgroups within the sample which might differ in either their means or in the pattern of correlation?
  - If significant differences in means exist, analyze within-cell correlation matrix (i.e., use PROC STANDARD to set the means in each group to zero before computing correlations.) Alternatively, code the group variable with dummy variables and examine the correlations with the factor variables.
  - If different pattern of correlations is expected, consider doing a separate analysis for each group or testing the hypothesis of equal covariance / correlation matrices. An alternative, which does not require splitting the sample is to include the dummy variable(s) in the analysis. If these dummy variables load (strongly) on any factors, the groups differ on this factor.
5. Correlations
  - Matrix of full rank? (Linear dependencies?) Check the value of the determinant of the correlation/covariance matrix. If near zero, delete one or more variables.
  - R = Identity? (Sphericity test). If you cannot reject the hypothesis that the variables are all uncorrelated, you have no business doing factor analysis.
  - With discrete, ordinal (or binary) measures the alternatives are to treat them as continuous anyway (use ordinary Pearson correlations) or use special procedures developed for polychoric (or tetrachoric) correlations. Robustness studies are mixed; they suggest that discreteness may introduce little bias in the estimation of parameter values, but may affect standard errors and ML c²; tests more seriously, particularly when the number of response categories is small (2-4). Hence, for exploratory studies the consequences may not be serious. For confirmatory studies, the PRELIS program, a companion program to LISREL 7, provides a special method to estimate a covariance matrix from ordinal data and a weight matrix which is used in LISREL for such data. See Bollen (1989, 433ff) and the LISREL 7 Guide for further discussion.

II. Exploratory Factor Analysis

Adequacy of common factor model?
- PROC FACTOR with METHOD=ML gives a test of the hypothesis that k = 0 common factors are necessary. You should be able to clearly reject this hypothesis.
- The common factor model assumes that unique factors are uncorrelated, but provides no test of this assumption. LISREL/CALIS models allow this assumption to be tested.
Determining the number of factors
- Number of factors = number of eigenvalues > 1. Unfortunately, this is the default for most factoring programs (SAS, SPSS, etc.). There are several heuristic rationales for this rule-of-thumb, but most evidence indicates it often gives the wrong number of factors.
- Scree test - Generally good results if there is a clear "break" in the plot of eigenvalues. Works best when ratio of p/k is large.
- Tests derived from the c²; test of the ML solution are generally preferred, but not without their own problems.
- Examine the matrix of residuals. We want to account for correlations or covariances. Hence the residual covariances should all be small if the model fits.
Chi Square test from Maximum Likelihood solution
Provides a statistical test of fit of the model with k factors, against the alternative hypothesis that S is unconstrained (any positive definite symmetric matrix). The test is based on the following assumptions, which are rarely fulfilled in practice:
- All observed variables have multivariate normal distribution.
- The analysis is based on the sample variance-covariance matrix not the correlation matrix.
- The sample size is 'fairly large' (the test relies on the asymptotic properties of maximum likelihood estimation (i.e., as N Ž Ľ)
Rather than regarding c²; as a formal test statistic, one should regard it as a badness of fit measure in the sense that large c²; values correspond to bad fit and small values correspond to good fit. From this perspective, the statistical problem is not one of testing a given hypothesis (which may be considered false a priori), but rather one of fitting the model to the data to decide whether the fit is adequate or not. With greater N you can extract more statistically significant factors.

The c²; measure is sensitive to sample size and very sensitive to departures from multivariate normality. Large sample sizes and departure from normality tend to increase c²; over and above what can be expected due to misspecification of the model. (See CFA below for alternative measures).

A more reasonable way to use the c² value is to compare the differences in c²; to the differences in degrees of freedom as more factors are added to the model. A large drop in c²; compared to the difference in d.f. indicates that the addition of one more factor represents a real improvement. A drop in c²; close to the difference in d.f. indicates that the improvement in fit is obtained by 'capitalizing on chance', and the added factor may not have real significance or meaning.
Other goodness of fit measures
There are a large number of alterntaive goodness of fit measures designed to overcome the limitations of the raw c²; test. Bollen (1989, 256-289) divides these into overall fit measures and incremental fit measures (how much better with one more factor?). Some of these are:
- c²; / df: For comparing models with different numbers of factors, some researchers use the ratio of c²; per degree of freedom, and interpret a value less than 2.0 as indicating adequate fit. Like the c²; itself, this index generally increases with sample size.
- Tucker - Lewis index.: This index scales the observed c²; to a range of approximately 0 - 1, where 0 represents the c²; obtained from a null model, and 1 represents an ideal fit. The idea is similar to the use of h² or w² in ANOVA as a measure of proportion of variance explained. The formula is
  
  TLI = c²₀ / df₀ -c²_m / df_m
  c²₀ / df₀ - 1
  
  (1)
  where the subscript 0 refers to the null model (no common factors: all variables uncorrelated) and m refers to the model being tested. Marsh, Balla & McDonald (1988) found TLI to be the only widely-used goodness of fit index which is not affected by sample size. Note, however, that rather large values are typically found, since fit is compared to a baseline null model: a value of TLI < .90 usually means that the model can be improved substantially ( Bentler & Bonnet, 1980).
- AIC. : Akaike's Information Criterion is becoming more widely used as a criterion for comparing models which vary in the number of free parameters. The essential idea is to penalize models with more free parameters, since they are more likely to fit. A related index is Schwartz' BIC measure, which includes the sample size in the penalty.
  
  AIC = c²_m + 2 t BIC = c²_m + t log( N )
  
  Chose the model which gives the smallest value of AIC or BIC.
Rotation
- Need rotation only because we can't visualize in k dimensions, but have to look at table of loadings.
- All rotations are equally good from a mathematical standpoint, but may differ substantively or in terms of interpretability.
- Does prior evidence warrant assumption of orthogonal factors? If so, use orthogonal rotation (e.g., varimax); otherwise use oblique rotation (e.g., promax).
- Oblique rotations typically give simpler structure (loadings), at expense of having to also interpret factor correlations.
- If oblique rotation is used, do the factor correlations differ from zero?
Interpretation
- Does the pattern of rotated loadings fit with hypotheses?
- In oblique solutions, it is often better to interpret the factor structure matrix (correlations of the variables with the factors) than the factor pattern matrix (loadings: weights used to calculate variable standard scores from factor standard scores). [These are the same in orthogonal solutions.] Moreover, the factor structure may be expected to remain stable over shifts in other factors which appear in an analysis, while the factor pattern usually does not.
- Salient loadings: Many people use | l | ł .3 or .4 as a criterion for salient loadings without any justification. Monte Carlo studies by Pennell (1968) and Cliff & Hamburger (1967) suggest that the correlations in the factor structure may be judged roughly using the formula for the standard error of an ordinary raw correlation doubled (i.e., 2 ×(1 - r² ) ¸Ön) to accommodate capitalization on chance. Horn (1967) and Humphreys et al. (1969) show that loadings arising by chance can be of impressive size. The less exploratory the study, the less capitalization on chance can occur.
  The vague basis for the .3-.4 rule of thumb appears to be this: With N=100, the minimum significant correlation at p< .05 is about 0.2. Doubling this gives 0.4. By this rule of thumb, interpreting a structure correlation of 0.3 as significant would require N>175.
  Note that with very large sample sizes, loadings so small as to be uninterpretable may still be significant. This may be another reason for the popularity of 0.3 as an absolute minimum.
- All interpretations of factors are post hoc unless subjected to confirmation by (cross-)validation or confirmatory hypothesis testing. Should be regarded as tentative, subject to further research rather than as final.
- Procustes rotation to specified factor pattern-- Specify the pattern of zero and non-zero loading, and attempt to rotate the loading matrix to this pattern. How close can you come?
- Many of these arbitrary rules-of-thumb disappear when CFA is used!
Reformulation of hypotheses and/or tests
Summarize the discrepancies between the hypotheses and the rotated solution. Might there be a need to redesign or replace any of the tests?
- If measures of reliability are available, the unique variance can be partitioned into unreliability (error variance) and specific variance.
- Measures with very small communalities (large unique variance) do not measure what the other tests measure. Perhaps you need to add other indicators of what these tests measure.
- Measures with very large communalities (> .95) cause numerical problems in maximum likelihood solutions ("Heywood cases").
Cross validation or replication
- Factors should be regarded as tentative until replicated. Just as in regression, the Factor Analysis model is fit by maximizing goodness-of-fit in the sample, i.e., by minimizing some function, F ( S , [^(S)] ) of the difference between the actual and fitted covariance matrix. A future sample will not fit as well .
- Some people suggest splitting the sample into halves in an exploratory study, using one-half to develop hypotheses, and the other half for confirmatory testing. This reduces the effective sample size for either part of the study, but it does provide for validation within a single study. The halves should be randomly determined.
- In the cross-validation design, the sample is split into random half-samples, 1 and 2. Let S₁ and S ₂ be the variance-covariance matrices for the two sub-samples, and let [^(S)]_{k | 1} and [^(S)]_{k | 2} be the reproduced variance-covariances from fitting structural model M _k to samples 1 and 2. Then an index of cross-validatation is the goodness of fit measure
  
  F ( S₁ , ^
  S
  
  k | 2
  )
  
  between the data for sample 1, S₁, and the fitted matrix for sample 2, [^(S)]_{k | 2} (and vice-versa).

III. Confirmatory Factor Analysis

Number of factors
Specify the number of factors based on exploratory analyses.
Pattern of loadings: fixed vs. free parameters.
Specify a hypothesis by constraining certain parameters in the factor matrices to be zero. The hypothesis is confirmed to the extent that the model still fits.
Note: For a LISREL factor analysis model to be identified, it is necessary to fix at least one loading on each factor to a non-zero value (e.g., 1.0) in order to fix the measurement scale of that factor.
Modification indices.
The LISREL program calculates "modification indices" for each fixed and constrained parameter; PROC CALIS calls these "Lagrange multiplier" tests. This index is the expected decrease in c²; if a single constraint in the hypothesis is relaxed, and all estimated parameters are held fixed at their estimated values. Each modification index is a c² with 1 df, and the parameter with the largest index will improve fit maximally. Relaxing parameters based on the modification index is only recommended when the parameter(s) freed make sense from a substantive point of view.
Similarly, the t-values for each free parameter provide a test of the hypothesis that the parameter equals 0. PROC CALIS provides a Wald test statistic as well, which is a 1 df c²; value. Both statistics evaluate whether a restriction (setting the parameter = 0) can be imposed on the estimated model.
Nested hypotheses and difference in Chi Square.
As described above under "Determining the number of factors", the c²; statistic is best regarded as a measure of badness of fit of the hypothesis. It makes sense to compare a series of hypotheses, H ₁, H ₂, H₃, ..., such that H₁ is the most stringent or restricted hypothesis, and H₂, H₃, ... successively relax some of the restrictions. If H_i is wholly included in H_i+1, then the difference in c²; between them can be regarded as a test of the parameters that are fixed in H_i but free in H_i+1.
For example if H ₁ and H ₂ both specify the same factor pattern, but H ₁ fixes the factor correlations, F = I, while H ₂ allows factor correlations to be free, the c²; difference is attributable to the correlations among the factors.

H_(1-2): F = I test by Dc² = (c₁² - c₂²) on Ddf = (df₁ - df₂) d.f.
Other measures of goodness of fit.
LISREL and PROC CALIS also give other indices which are useful in assessing the fit of a hypothesis:
Goodness of fit index (GFI): A measure of the relative amount of variances and covariances accounted for by the model, and an adjusted goodness of fit value (AGFI), adjusted for degrees of freedom. Both measures are between 0 and 1, where 1=perfect fit. Unlike the c²; , Jöreskog & Sorbom (1984) claim that both the GFI and AGFI index are independent of sample size and relatively robust against departure from normality. Their distributional properties are unknown, however, so there is no significance test associated with them.

It should be emphasized that the measures c², GFI, and AGFI are measures of the overall fit of the model to the data and do not express the quality of the model by any other criteria. For example, it can happen that the overall fit of the model is very good, but one or more relationships in the model is poorly determined (as indicated by the squared multiple correlations), or vice versa. Furthermore, if any of the overall measures indicates that the model does not fit well, that fact does not tell what is wrong with the model. Diagnosing what part of the model is wrong can be done by inspecting the normalized residuals (which correlations are not well fit?) and/or the modification indices (which fixed parameters might be relaxed?).
Absolute vs. relative measures. There are now many new measures of goodness-of-fit, but they may be classified as
- absolute measures: how well does this model fit? vs.
- relative measures: how well does this model fit compared to- the null model, the saturated model, or a simpler model.
Of all available software, only AMOS provides these model comparison statistics automatically when you fit a series of models. My CALISCMP macro provides similar model comparison statistics for a set of models fit using PROC CALIS.

Good luck!

Contents