Exploring the Baseball data with SAS and SAS/INSIGHT


Here are some things to try, using the dataset BASEBALL in the PSY6140 library:

Locating the Baseball files

The datasets are stored as SAS system files, in the folder (directory) known to SAS as PSY6140. On the Hebb/Acadlabs servers, this library name is automatically allocated when SAS starts.

In SAS programming statements, refer to the data in a PROC step or DATA step by the following names:

SAS name
------------
psy6140.baseball
psy6140.pitcher
psy6140.team
e.g.,
proc univariate data=psy6140.baseball;
   var ....
The SAS files which create these datasets are stored in the directory N:\psy6140\data on the Hebb server.

Starting SAS/INSIGHT

Start SAS/INSIGHT by typing insight tools in the command area in the upper left corner of the SAS Program Manager window. You should refer to the document, Using SAS/INSIGHT for other ways to start SAS/INSIGHT.

Viewing the data

In the Open panel, select the PSY6140 library, and the name of the dataset (BASEBALL) you want to work with. The dataset appears in a spreadsheet format.

Some things to try:

  1. Double-click on the NAME variable to see its properties. Assign NAME the LABEL role. Ctrl-Click to deselect the NAME variable.
  2. Double-click on the "4" next to Andre Dawson. The Examine Observations panel pops up, showing his statistics.
  3. From the Edit menu, select Observations, then Find... Highlight all the players on the Atlanta team by selecting TEAM, and ATL from the panels. When you click Apply or OK, all players are highlighted in the spreadsheet. What happens if you then click a color or symbol in the tool bar?

Overview of the data

Select the variables SALARY (first), RUNS (next), then YEARS, by Ctrl-clicking each in the data table. Select Scatter Plot (Y X) from the Analyze menu. Note that the order in which variables are selected determines their order in the scatterplot matrix..

The scatterplot matrix is linked to the data table (and to all active views of the dataset:

Correlations, means, standard deviations, etc. for any number of variables can be obtained using by selecting Multivariate (Y X) from the Analyze menu.

Preliminary analyses

Examine the univariate distribution of any variable(s) by first selecting that variable (click on its name) in the data window. Then choose Distribution (Y) from the Analyze menu. Once the graphs appear, you can add various smoothed curves using options on the Graphs menu.

Subgrouping the data

Click on the PSY6140.BASEBALL spreadsheet window to make it the active window. Choose Windows:Animate... from the Edit menu. Select the League variable. Click Apply, then Pause. Select "A" in the Value: panel. Note that all observations from the American League are selected. In the Tools panel, select a color and/or marker shape to change the display of points for American League players.

For some purposes, a simpler way to do the same sort of thing is to click the "rainbow" button on the Tools panel. In the Color Observations panel that appears, select, e.g., POSITION. Then, try scatterplot(s) of some variables (e.g., PUTOUTS vs. ASSISTS) to see them color coded by the variable you selected. (You can also use a continuous variable for color coding by rainbow colors.)

Similarly, the button containing all point markers in the Tools panel brings up a Mark Observations panel. You can assign the same or a different variable to set the point markers for observations.

Try some simple regression models

Deselect the variables in the data table by clicking them again. Then, select Fit (Y X) from the Analyze menu to fit a model predicting SALARY from YEARS and RUNS.

Some other things

  1. How can you make a boxplot of the SALARY variable? What players have extremely high/low salaries?
  2. How can you make a set of parallel boxplots of SALARY, stratified by number of YEARS in the major leagues? (Hint: assign YEARS as the X variable in the Boxplot dialog box.)
  3. How can you transform a variable? (Hint: select a variable, then look in the Edit menu)
  4. How can you assign a format to a variable? (Hint: same hint as above)
  5. Examine the data table for any newly-created variables from the Fit analysis. See the VARS menu for some other variables that can be added to the dataset.
  6. Some analyses add additional variables to the data table (e.g., residuals, Cook's D, etc.). If you want to save the modified data table, choose Save:Data... from the File menu. You can only save it to the HOME Library, which is located on your F: drive. To access the modified data later, Open it from the HOME library.
  7. To exit SAS/INSIGHT, click in the Menu square (upper left corner) of the data table window, and choose Close.

SAS programming statements and macro programs

There are quite a few things that can be done only with SAS programming statements or with my SAS macros. From the Windows menu (or the tabs at the bottom of the SAS screen), select Program Editor, and type your statements there. Or, if you are reading this online, you can simply copy and paste the statements into the Program Editor window.

Transforming variables

To transform variables, use a DATA step with SAS assignment statements to create new variables. The statements below create a copy of the baseball data in the WORK library.
data baseball;
   set psy6140.baseball;
   logsal = log10(salary);
   label logsal = 'log Salary';

Creating dummy variables

You can create dummy (0/1) variables with SAS programming statements, but it is simpler to use the DUMMY macro. For example, this call creates dummy variables for the POSITION variable, using the short list of recodes described by the $pos. format.
%dummy(data=baseball, var=position, format=$pos, base=first);

Checking for outliers

The statement below calls the OUTLIER macro to give an outlier plot for the current year variables.
%outlier(data=baseball, id=name,
   var=logsal atbat hits homer runs rbi walks years 
       putouts assists errors);

Sample regression

symbol value=dot;
proc reg data=baseball;
   model logsal = years runs hits rbi walks;
   plot r. * p.;
   run; quit;

Influence plots

The statement below calls the INFLPLOT macro to detect influential observations in this model.
%inflplot(data=baseball, id=name,
   y=logsal, 
   x=years runs hits rbi walks);

Box-Cox transformations

The Box-Cox procedure finds a transformation of the Y variable which minimizes the MSE. This often makes the distribution of Y more symmetrical, promotes constant variance, and makes the relations more linear.
  1. The statement below calls the BOXCOX macro to transform SALARY in a model using YEARS, RUNS, and HITS as predictors:
    %boxcox(data=psy6140.baseball,
       resp=salary,
    	model=years runs hits,
    	id=name,
    	gplot=RMSE EFFECT INFL);
    
  2. Use %webhelp(boxcox); to get information on the use of the boxcox macro. (%webhelp(?); gives some basic information on its use.)

Merging files

Sometimes you may want to combine variables from separate files. For example, say you wanted to use one or more of the variables (e.g., ATTHOME) in the TEAM file as predictors. To do this, sort both the BASEBALL and TEAM files by the TEAM variable, then use a MERGE statement in a DATA step:
title2 'Merge team at-home attendance with individual data';
proc sort data=psy6140.team out=team;
   by team;
proc sort data=baseball;
   by team;
data basenew;
   merge baseball team(keep=team atthome);
   by team;

To merge the other way--- incorporating variables from the individual file into the TEAM file, first use PROC SUMMARY to get the mean (or median) of the variables you want, then merge as above:

title2 'Average stats by team, merge with the team data'   ;
proc summary data=baseball nway;
   class team;
   var runs homer;
   output out=means mean=;

proc sort data=team;
   by team;
data teamnew;
   merge team means(keep=team runs homer);
   by team;

Created: Friday, October 27, 1995, 07:24 AM Updated 10/05/2012 15:06:58