[Previous] [Up] [Top]
Multivariate Data Analysis
Several research problems, involving ANOVA, MANOVA, Repeated measures,
logistic regression, or factor analysis
are described below. There is also a "free choice" topic.
Your goal in this is to demonstrate what you've learned about
the methods of analysis in the second half of the course.
The SAS input files are linked on this page; some R and SPSS versions are
available on the Hebb server or the web.
For TWO of these problems,
Place the Results/Discussion section first, followed by
answers to the questions, as seems appropriate for the problem.
To avoid unnecessary
duplication, in answering the questions you may refer to results
already described in your results section. Except for the
"free choice" question, you need not provide
any general introduction, methods, or conclusions sections.
- perform the analyses guided by the Questions
section, some more detailed than others.
- answer the questions, and,
- prepare a Results/Discussion
section, with description of your methods of analysis,
results and your interpretation, and any
accompanying figures and/or tables deemed necessary,
suitable for a research report on the data.
For ease of reading, please format your paper
with figures and tables
presented inline where possible, rather than as a manuscript submission,
where figures and tables generally appear at the end.
If you use R and R Studio, you may find it convenient to write your
reports using R markdown ,
which allows you to mix normal writing with R code and output.
A behavioural manual was developed to supplement the treatments
offered by weight-loss clinics. The manual described techniques for
self-monitoring, developing effective coping strategies, changing
eating habits, and avoiding regaining the lost weight over time. Two
clinics were selected for study and each ran two groups at different
times during the same evening of each week. Within each clinic one
group was given the behavioural manual in addition to the regular
package while another group was not given the manual. In addition,
it was thought that the length of time an individual had been trying
to lose weight might affect the outcome, so the volunteers were
classified as "experienced slimmers" or "novice
slimmers". The between-S design is thus a 3-way factorial,
Condition x Slimming Status x Clinic.
Weight and body girth measures were taken at three occasions: 9
weeks, 3 months, and 1 year. The weight measures were first
expressed as a percentage overweight value, taking the persons height
and age into account. These overweight percentages were then
expressed as a percentage change on each occasion, relative to the
initial baseline value taken prior to the start of the course.
For example, a value of -5.5 for OW9 means a 5.5% decrease in
the overweight percentage at 9 months, relative to the overweight
percentage at baseline.
Similarly, the girth measures (the 3 months values to be analyzed
here) were expressed as percentage change from the baseline values.
The data are contained in the SAS file slim.sas in the
N:\psy6140\data directory on the Hebb server.
An SPSS version is available in N:\psy6140\lib\spss\slim.sav, or
on the web. There is also a CSV file, slim.csv
that can be read in R or other software.
The variables are:
COND - Condition: 1=Experimental, 2=Control
STATUS - Slimming status: 1=Experienced, 2=Novice
CLINIC - A or B
OW9 OW3 OW1 - change in overweight percentage, at 9 weeks, 3 months, 1 year,
relative to baseline.
BUST -- ARM - percentage change in various girth measures at 3 months vs. baseline.
Recent work suggests that Alzheimer's may involve pathological
changes in the central cholinergic system which result in
deterioration in memory. If so, it might be possible to halt or slow
down the memory impairment by long-term dietary supplements of
lecithin, a chemical precursor of choline.
- The researchers first concern was with the overweight variables.
Carry out a multivariate analysis to determine if mean
differences exist in the OW variables according to the
between-S variables. Perform a parallel analysis treating the
OW variables as a repeated measure factor (for this purpose,
assume the measures were equally spaced in time). Summarize
and contrast the results of these analyses.
- The researchers next wished to assess the impact of condition and
status on the girth change measures; in particular they wished
to know if the behavioural manual or slimming status had
differential effects on the measures at different body
locations. Perform an appropriate analysis to answer these
questions. Summarize these results and describe how your
analysis relate to the questions asked.
- A final question is whether weight change measures add anything to
the analysis of treatment and status effects.
Repeat the analysis performed for
the previous question, but enter the overweight measures as
A study was carried out with two randomly assigned groups of
Alzheimer's patients, one group being given lecithin and the other
given a placebo over a 6 month period. To assess memory functioning
in a sensitive way, two types of free recall tests were given to each
subject at each of five times: 0, 1, 2, 4, and 6 months. In the
first type the same words were repeated at each test; in the
second, different but equivalent words were used each time.
Hence, differences in performance on the two types of tests should be
attributable to long-term learning.
The design therefore has one between-S factor and two within-S
factors. The major question is whether the difference in performance
on the two test types is the same or not for the two groups.
The data are contained in the file lecithin.sas on
An SPSS .sav file version is available in
on the web.
The CSV version is lecithin.csv.
Scores on the repeated test are denoted A1 - A5; scores
on the non-repeated test are B1 - B5. Each score is the number of
words recalled out of 30. Group is coded 1 = Placebo, 2 = Lecithin.
- Examine the data for multivariate outliers, and examine the need
for a transformation of these variables to approximate
symmetry. [Since the data are counts out of a maximum, they
are analogous to proportions.]
- Carry out the complete repeated measures analysis for the 3-way
design for these data, with appropriate tests for (a) whether
assumptions of the univariate (mixed model) analysis are met;
and (b) polynomial trends for the TIME factor. Note that the
times points are unequally spaced, so you will have to include
the time values in the REPEATED statement.
- In a data step construct new variables,
ABAR = mean( of A1-A5 );
BBAR = mean( of B1-B5 );
SBAR = mean( of A1--B5 );
ABDIF= ABAR - BBAR;
Carry out a univariate analysis of group differences on each
of the variables ABAR BBAR SBAR ABDIF and show how these
relate to the analyses carried out in step (2).
3. Survival in the ICU
This question uses logistic regression to study the survival of patients
following admission to an adult intensive-care-unit (ICU). The data consist of
a sample of 200 subjects who were part of a much larger study. The data is
taken from Applied Logistic Regression by Hosmer and Lemeshow.
The ICU data has 200 rows and 21 variables. The rows correspond to the 200
patients, while the columns correspond to the input and response variables.
Column 1 is an identification code (ID) unique to each patient, which can
be ignored in the analysis, but which should be used to identify
The response variable Y= died indicates
whether the subject was alive (0) or dead (1) when he/she left ICU.
The remaining variables in columns 3-21 are the predictor
variables. The binary predictor variables are all coded so that
the value 1 corresponds to a possible risk factor.
In addition, the 3-level variable race has been supplemented
by a binary variable white and the variable coma
supplemented by a binary variable uncons = (coma>0)
A code sheet for the variables is provided in Table 1. You can find the
data in N:\data\icu.sas, as a SAS input file,
as an SPSS input file, N:\data\icu.sps,
an SPSS system file, icu.sav
and in N:\data\icu.dat as a plain ASCII data file (for use with
any other statistics package.
In R, the data set ICU is contained in the vcdExtra package.
Table 1: Code Sheet for ICU Data
In SAS/INSIGHT, you can do a logistic regression from the Fit Y X
menu, by selecting Options -> Response Dist binomial, Link function: logit.
|1 ||Identification Code ||ID Number ||id |
|2 ||Vital Status ||0 = Lived |
1 = Died
|3 ||Age ||Years ||age |
|4 ||Sex ||0 = Male |
1 = Female
|5 ||Race ||1 = White|
2 = Black
3 = Other
|6 ||Service at ICU admission ||0 = Medical|
1 = Surgical
|7 ||Cancer part of present problem|| 0 = No|
1 = Yes
|8 ||History of Chronic Renal Failure ||0 = No|
1 = Yes
|9 ||Infection Probable at ICU admission||0 = No|
1 = Yes
|10 ||CPR prior to ICU admission ||0 = No|
1 = Yes
|11 ||Systolic blood pressure at ICU admission||mm Hg ||systolic |
|12 ||Heart Rate at ICU admission ||beats/min ||hrtrate |
|13 ||Previous ICU admission within 6 mths.||0 = No|
1 = Yes
|14 ||Type of admission ||0 = Elective|
1 = Emergency
|15 ||Fracture: Long bone, multiple, neck,|
Single area or Neck
|0 = No|
1 = Yes
|16 ||PO2 from initial Blood Gases ||0 =>60|
|17 ||PH from initial Blood Gases ||0 =~=\!7.25|
|18 ||PCO2 from initial Blood Gases||0 =~=\!45|
|19 ||Bicarbonate from initial Blood Gases||0 =~=\!18|
|20 ||Creatinine from initial Blood Gases||0 =~=\!2.0|
|21 ||Level of Consciousness at ICU admission|| 0 = No Coma/Stupor|
1 = Deep Stupor
2 = Coma
- Run a logistic regression using all 19 predictor variables --
age sex white service cancer renal infect
cpr systolic hrtrate previcu admit fracture po2 ph pco bic
Which variables appear to be strong predictors of survival?
Which variables appear to be unnecesary?
Are there any variables which, on logical grounds, should be
included in any model?
- Use forward, backward, or stepwise selection to determine
a minimal model where all predictors are individually significant
at the 0.05 level, but perhaps forcing any variables you consider
necessary to remain.
Compare this model to your model with all variables in terms
of goodness of fit and lack of fit.
- Investigate whether any quadratic terms (in a quantitative predictor)
or interaction terms are necesary among those predictors in your model
from the previous step.
- Without further analsis (or biological background) does the model seem
reasonable? Which terms did you expected to see?
- In a sample of this size and heterogeneity, it may be considered likely
that there are one or more cases of high leverage or influence.
Find the 3-6 cases with the largest value of Cook's D (the C statistic
in SAS) and interpret the nature of their influence on your model.
[Hint: Use the %inflogis macro.]
- Investigate the predicted probability of death while in the ICU
for the patients in this sample.
Use your final model to obtain predicted probabilities.
Plot Pr(died) as a function of age, with other variables held fixed
at high or low values of some of your important risk factors.
- Given the level of success of your model in predicting a patient's
outcome, how might this study be re-done to increase the accuracy of
Is there an alternative design, or other variables which should be
Holzinger & Swineford (1939) gave 24 tests of a variety of
psychological abilities to junior high school students at two
schools. These data are typical of the kinds of ability tests which
have been used throughout the history of factor analysis and are one
of the most widely studied sets of correlations in the factor
analysis literature. The factor analytic problem is to determine the
number and kind of dimensions or latent abilities which may be used
to describe the correlations among these tests. Sample test items
from the tests are given in HolzingerSwineford.pdf.
The orienting questions here are more detailed
than in other problems; you are free to choose a reasonable subset.
The data for this problem are available in several forms on the Hebb N: disk:
The raw data for both samples is contained in the file
in SPSS format as
and in CSV format as
psych24r.csv. An R script,
psych24r.R is also provided for reading in the raw data from the CSV file.
The raw data also gives sex and chronological age
for all subjects. The two samples are distinguished by the variable
The correlations for the Grant-White sample (N=145) are
contained in the file
psych24c.sas. The means, standard
deviations and measures of reliability(2) are
also contained in the correlation file for the Grant-White data.
The same correlations are also provided in R format, in
(2) Gorsuch (1974) reports that "the raw data
does not always agree with the published statistics". Here
we will assume that the correlations are correct.
The common practice in analysis of these data is to include
variables 25 and 26 but not variables 3 & 4, (25 & 26 were attempts
to develop better tests for variables 3 & 4) when the Grant-White
sample alone is analyzed. However, in order to be able to compare
the results of the two schools, we will ignore variables 25 and 26
here. Moreover, to reduce the size of the problem somewhat, we will
only use the first 18 variables, named V1-V18 in the SAS files. (For
those wishing to try the AMOS program, two raw data files, GRANT
AMD and PASTEUR AMD, containing only V1-V18 are available
on the Hebb server.)
From the descriptions of the tests, attempt to develop some
theory, however vague, of the manifest content of these 18 tests.
Which ones should tend to tap the same underlying abilites? How many
different abilities? What results are reported in previous analyses?
Do they make sense?
Examine the raw data in the Grant-White sample (GRP=1) for:
Also, is there any evidence that the means of the two samples differ
substantially for any of the variables?
- Univariate outliers or unusual observations on individual
variables (Proc UNIVARIATE with PLOT option).
- Univariate normality (skewness, kurtosis)
- Multivariate outliers: observations whose distance from the
centroid is large (see OUTLIER SAS on the class disk)
- Do the means and standard deviations agree with the published
Exploratory Factor Analysis
Use the correlation matrix from the Grant-White data (in
for this part.(3)
(3) Since the data set contains the standard
deviations, the analysis can be done using either the correlation
or covariance matrix.
- Determine the number of factors necessary to adequately explain
the correlations among these tests. Do various criteria tend
to converge or do they indicate different numbers of factors?
- Find a rotated factor solution which provides an interpretable
description of the correlations among the tests. Are
orthogonal or oblique factors more resonable? If oblique, are
all factors correlated, or can some pairs be considered
Confirmatory Factor Analysis
This part of the problem is optional, for extra credit.
To test your
hypothesized factor structure there are two possibilities:
Find an interesting set of data for which the methods of the second half of the course (ANOVA/MANOVA, Logistic regression, PCA/FA) would be
appropriate and which is neither too simple nor too complex (e.g., for
2-3 between-S factors, 2-5 response measures or repeated measures).
If you chose this, you should
make up your own questions, comparable in scope to those in the
Answer them, and prepare a Methods and Result sections suitable
for a research report. You will also need to provide a brief introduction
to provide the context.
- The simplest is just perform a Procrustes target rotation
specifying a target factor pattern determined from your
exploratory analyses. [The target factor pattern is a
matrix of 0s and 1s where the 1s specify the variables
hypothesized to load on each factor. For PROC FACTOR, this
matrix is read in with columns specifying the variables.]
- A stronger test is available by fitting a restricted factor model
using PROC CALIS, LISREL, AMOS, or the R packages sem