Exploring the Baseball data
with SAS and SAS/INSIGHT
Here are some things to try, using the dataset BASEBALL in the PSY6140 library:
Locating the Baseball files
The datasets are stored as SAS system files, in the
folder (directory) known to SAS as PSY6140.
On the Hebb/Acadlabs servers, this library name is automatically allocated when
SAS starts.
In SAS programming statements, refer to the data in a PROC step or DATA step by the following names:
SAS name
------------
psy6140.baseball
psy6140.pitcher
psy6140.team
e.g.,
proc univariate data=psy6140.baseball;
var ....
The SAS files which create these datasets are stored in the
directory N:\psy6140\data
on the Hebb server.
Starting SAS/INSIGHT
Start SAS/INSIGHT by typing insight tools
in the command area
in the upper left corner of the SAS Program Manager window. You should refer
to the document, Using SAS/INSIGHT
for other ways to start SAS/INSIGHT.
Viewing the data
In the Open panel, select the PSY6140 library, and
the name of the dataset (BASEBALL) you want to work with.
The dataset appears in a spreadsheet format.
Some things to try:
- Double-click on the NAME variable to see its properties. Assign NAME the LABEL role. Ctrl-Click to deselect the
NAME variable.
- Double-click on the "4" next to Andre Dawson. The Examine Observations
panel pops up, showing his statistics.
- From the Edit menu, select Observations, then Find...
Highlight all the players on the Atlanta team by
selecting TEAM, and ATL from the panels. When you click Apply or OK,
all players are highlighted in the spreadsheet.
What happens if you then click a color or symbol in the tool bar?
Overview of the data
Select the variables SALARY (first), RUNS (next), then YEARS, by Ctrl-clicking each in the data table.
Select Scatter Plot (Y X) from the Analyze menu. Note that the order in which variables are selected determines
their order in the scatterplot matrix..
The scatterplot matrix is linked to the data table (and to all active views of the dataset:
- Click on any point in the scatterplot matrix. The NAME label apears, and the corresponding point is highlighted in all other
pairwise plots. Ctrl-click on another point to extend the selection.
- Click on any observation number in the data table. The corresponding point is highlighted in the scatterplot matrix.
- Double-click on any observation number in the data table. A popup window shows all values for this observation.
Correlations, means, standard deviations, etc. for any number of variables
can be obtained using by selecting Multivariate (Y X) from the Analyze
menu.
Preliminary analyses
Examine the univariate distribution of any variable(s) by first selecting that variable
(click on its name) in the data window. Then choose Distribution (Y) from the
Analyze menu. Once the graphs appear, you can add various smoothed curves
using options on the Graphs menu.
- Choose Kernel Density... from the Curves menu, press OK to select the
default. A smoothed, non-parametric
curve is fit to the distribution and displayed on the histogram.
- Choose Parametric Density... from the Curves menu, press OK to select the
default. A smoothed,parametric (Normal)
curve is fit to the distribution and displayed on the histogram.
Subgrouping the data
Click on the PSY6140.BASEBALL spreadsheet window to make it the active window.
Choose Windows:Animate... from the Edit menu.
Select the League variable.
Click Apply, then Pause.
Select "A" in the Value: panel. Note that all observations from the
American League are selected.
In the Tools panel, select a color and/or
marker shape to change the display of points for American League
players.
For some purposes, a simpler way to do the same sort of thing is to
click the "rainbow" button on the Tools panel.
In the Color Observations panel that appears, select, e.g.,
POSITION.
Then, try scatterplot(s) of some variables
(e.g., PUTOUTS vs. ASSISTS) to see them color coded by the variable
you selected.
(You can also use a continuous variable for color coding by rainbow
colors.)
Similarly, the button containing all point markers in the Tools panel
brings up a Mark Observations panel. You can assign the
same or a different variable to set the point markers for observations.
Try some simple regression models
Deselect the variables in the data table by clicking them again. Then,
select Fit (Y X) from the Analyze menu to fit a model predicting SALARY from YEARS and RUNS.
- Select the SALARY variable, then click on Y to assign it as the dependent variable.
- Select YEARS, then click on X to assign it an a predictor. Do the same with RUNS.
- Click on the Output button to see what options are available. Select Nonparametric Curves and click on loess to fit a smooth
curve to the data
- When you have completed your OUTPUT selections, click on the Apply button in the Fit (Y X) panel.
Some other things
- How can you make a boxplot of the SALARY variable? What players have extremely high/low salaries?
- How can you make a set of parallel boxplots of SALARY, stratified by
number of YEARS in the major leagues? (Hint: assign YEARS as the X variable
in the Boxplot dialog box.)
- How can you transform a variable? (Hint: select a variable, then look in the Edit menu)
- How can you assign a format to a variable? (Hint: same hint as above)
- Examine the data table for any newly-created variables from the Fit analysis. See the VARS menu for some other variables
that can be added to the dataset.
- Some analyses add additional variables to the data table
(e.g., residuals, Cook's D, etc.).
If you want to save the modified data table, choose Save:Data...
from the File menu.
You can only save it to the HOME Library, which is located on your
F: drive. To access the modified data later, Open it from the HOME library.
- To exit SAS/INSIGHT, click in the Menu square (upper left corner) of the
data table window, and choose Close.
SAS programming statements and macro programs
There are quite a few things that can be done only with SAS programming
statements or with my SAS macros.
From the Windows menu (or the tabs at the bottom of the SAS screen), select Program Editor, and type your
statements there. Or, if you are reading this online, you can simply copy
and paste the statements into the Program Editor window.
Transforming variables
To transform variables, use a DATA step with SAS assignment
statements to create new variables. The statements below create
a copy of the baseball data in the WORK library.
data baseball;
set psy6140.baseball;
logsal = log10(salary);
label logsal = 'log Salary';
Creating dummy variables
You can create dummy (0/1) variables with SAS programming statements,
but it is simpler to use the
DUMMY macro.
For example, this call creates dummy variables for the POSITION
variable, using the short list of recodes described by the
$pos. format.
%dummy(data=baseball, var=position, format=$pos, base=first);
Checking for outliers
The statement below calls the
OUTLIER macro
to give an outlier plot for the current year variables.
%outlier(data=baseball, id=name,
var=logsal atbat hits homer runs rbi walks years
putouts assists errors);
Sample regression
symbol value=dot;
proc reg data=baseball;
model logsal = years runs hits rbi walks;
plot r. * p.;
run; quit;
Influence plots
The statement below calls the
INFLPLOT macro
to detect influential observations in this model.
%inflplot(data=baseball, id=name,
y=logsal,
x=years runs hits rbi walks);
Box-Cox transformations
The Box-Cox procedure finds a transformation of the Y variable which
minimizes the MSE. This often makes the distribution of Y more symmetrical,
promotes constant variance, and makes the relations more linear.
- The statement below calls the
BOXCOX macro to transform SALARY in a
model using YEARS, RUNS, and HITS as predictors:
%boxcox(data=psy6140.baseball,
resp=salary,
model=years runs hits,
id=name,
gplot=RMSE EFFECT INFL);
- Use
%webhelp(boxcox);
to get information on the use of the boxcox macro.
(%webhelp(?);
gives some basic information on its use.)
Merging files
Sometimes you may want to combine variables from separate files.
For example, say you wanted to use one or more of the variables (e.g., ATTHOME)
in the TEAM file as predictors. To do this, sort both the BASEBALL
and TEAM files by the TEAM variable, then use a MERGE statement
in a DATA step:
title2 'Merge team at-home attendance with individual data';
proc sort data=psy6140.team out=team;
by team;
proc sort data=baseball;
by team;
data basenew;
merge baseball team(keep=team atthome);
by team;
To merge the other way--- incorporating variables from the individual
file into the TEAM file, first use PROC SUMMARY to get the
mean (or median) of the variables you want, then merge as above:
title2 'Average stats by team, merge with the team data' ;
proc summary data=baseball nway;
class team;
var runs homer;
output out=means mean=;
proc sort data=team;
by team;
data teamnew;
merge team means(keep=team runs homer);
by team;
Created: Friday, October 27, 1995, 07:24 AM
Updated 10/05/2012 15:06:58