Friday, 23 November 2012

SAS Programming


                                                    SAS PROGRAMMING  
                                                                             

Many businesses and individuals need to analyze data in order to make better decisions.  As businesses become more complex, there is more information and it needs to be examined.  Sometimes students must do research projects where they need to collect information in order to analyze it.                                             

The amount of data is increasing daily.  A newer development is the Internet and e-commerce.  Companies that do business on-line want to collect information to see what type of people use their web site.  They can also look at how people use their Web site in order to design their information easier.  They also have to provide round the clock service.  They need software to help them analyze their data.   

In 1976, The SAS Institute Inc., a privately held corporation was formed.  The product at that time was known as the "Statistical Analysis System."  It grew in popularity and capability and was used in academic groups.  These people needed a software package that would do statistical calculations easily.  They were not necessarily programmers.  SAS can be used without knowing much about programming but it is also a very sophisticated language and more can be done with it. 

It has grown into the world's largest privately held software company.   Continual product line expansion and diversification of clientele have resulted in SAS products being used by over 40,000 customer sites in 50 countries.  There are 3.5  million users of SAS products.   Part of the reason for the continual growth is that the SAS Institute works with the end user to improve its product.  It offers solutions for data warehousing, data mining, data visualization, and applications development.  SAS now stands for the SAS System. (1) 

SAS is used in many different types of businesses including banking, manufacturing, government, insurance, telecommunications/utilities, sales and services and healthcare. (1)

SAS is located in Cary, North Carolina.  It is a world-wide company with business in Asia, Pacific and Latin America, Europe, Middle East and Africa. SAS also has a good employee retention rate of 96%.  It also is a family oriented company and is friendly to working women (1).   
 
1.  This information was obtained from the SAS web site at http://www.sas.com,

The SAS System is an applications system that can be used as
1). a statistical package,
2). a data base management system and
3). a high level programming language.



When people want some kind of information, they usually start with an application for data.  An applications system is software that gives you the tools you need to make the data useful and meaningful.                                          

In order to be useful, an applications system should give you  
1). total control of your data,
2). facilitate applications that run in more than one computing environment, and
3). accommodate varying skill levels of potential users. 
SAS can do all of these.

Some types of data that may be collected are:
* Payroll and employment data
* Student data and class data
* Research data
* Medical data
* Inventory data and sales data
* Web data on customers
* Areas such as physical science, social science, business, agriculture

With any body of data, you must perform four basic tasks to make it useful and meaningful.  You can:

            ACCESS -- First, you access the data through the SAS system 

MANAGE -- Update, rearrange, combine, edit, or subset  data before analyzing

ANALYZE --Ranges from simple descriptive statistics to more advanced or specialized   analyses for econometrics and forecasting, statistical design, computer performance evaluation, and operations research

PRESENT --Presentation capabilities range from simple list and tables to multidimensional plots to elaborate full-color graphics, both on paper and on your display.

SAS is also portable across computing environments.  A computing environment is determined by the HARDWARE and the host OPERATING SYSTEM running it.  SAS can be used on IBM mainframes, UNIX based machines, on personal computers using Windows. 

PORTABILITY means that SAS applications:
* Function the same
* Look the same
* Produce the same results
on mainframes, minicomputers, or microcomputers.

You can develop SAS applications in one environment and run them in other environments without rewriting the programs.

SAS is a powerful programming language which has a collection of ready-to-use programs called procedures.  It can give an unlimited variety of applications--from general purpose data processing to highly specialized analysis in diverse applications areas. 

                                          INTRODUCTION TO PROGRAMMING
                                                 USING BASE SAS SOFTWARE

The SAS System is a software system composed of computer programs that work together to perform specific tasks.  The system reads data, such as letters or numbers, in various forms and organizes them in a SAS data set or Table.   Today the use of Table is used instead of data set because it is more consistent with the relational databases that are used today.    Both of these names are used interchangeably.  In these notes, we will try to use Table instead of data set.

A Table stores data in a form the system can identify and manage as a unit. 

Once data is organized into a Table, you can access, analyze, revise, and display the data using one computer program.  You do not need to prepare separate programs for different tasks.

These are the parts of a Table:

DATA VALUE -- A single unit of information, such as a person's height.  Each of the items recorded is a data value.

COLUMN or Variable -- Set of data values that describes a specific characteristic, for example heights of all individuals in a group.  The age values make up the AGE column. These used to be called Variables. Column will be used in the notes.

SAS data types are classified as CHARACTER or NUMERIC.

CHARACTER columns contain data values consisting of:
* Combination of letters of the alphabet
* Numbers (such as an id number or a zip code) (These are not used in calculations)
* And special characters or symbols

NUMERIC columns contain:
* Numbers and related symbols, such as decimal points, plus signs, and minus                        signs




                                                                     A TABLE

                                Column (or Variable)
                                                                                  9

NAME

SEX

AGE

STWGT

ENDWGT

HEIGHT

TEAM

Charlene Armstrong

F

35

152

139

66

Yellow

David Shaw

M

27

189

165

68

Red

Amelia Serrano    

F

50

145

124

65

Yellow

Ann Nance

F

31

210

192

72

Red

Ravi Sinha

M

48

194

177

67

Yellow

Ashley McKnight

F

26

127

118

62

Red

Jim Brown

M

41

220



73

Yellow

Susan Stewart

F

29

135

126

63

Red

Rose Collins

F

37

155

141

67

Blue

Jason Schock

M

28

187

172

77

Green

Kanoko Nagasaka   

F

46

135

126

63

Blue

Richard Rose

M

33

181

166

72

Green

David Sims

M

50

280

300

70

Blue

Elizabeth Sims

F

48

300

200

65

Green

Tim Jones

M

35

280

168

70

Blue

Larry Goss        

M

21

188

174

73

Green

Asha Garg

M

56

148

132

61

Yellow

Jennifer Brooks

F

42

208

165

72

Red
8                                                                                                                                  8
   Row (or Observation)                                                                                      Data Value


COLUMN NAMES can contain:
In the older versions of SAS, column names could only be 1 to 8 characters long. But with Version 8 of SAS (which is what you will be using in class), the rules have changed.


* 32 or fewer characters in length
* MUST begin with a letter or underscore (_)
* Subsequent characters must be letters, numbers, or underscores (Do not use %$!*&#@)
* BLANKS CANNOT be used in column name
* Select descriptive names that reflect the contents of each set of data values
* Names can contain upper and lowercase letters

 ROW or Observation-- is a set of data values for the SAME ITEM, for example all physical measurements for one person.  There are 18 rows in our data set above.  Each row of information contains name, sex, age, stwgt, endwgt height for each person.

MISSING VALUES -- Represent missing or unavailable data values to the SAS system.  Missing values are represented with periods (for numeric) and blanks (for character) data when data is printed out.

                                      ENTERING DATA INTO THE COMPUTER

A computer program without data is of no value.  One of the first steps is to know how to enter data in a form that the computer can read. 

You may want to conduct a study analyzing specific physical data on a series of people who are involved in a health club.  The first step is figure out what information you will need.  You would collect it from the people who are in the study. You may do this by having each member fill out a form with the information the health club wants to analyze.  

Next, someone would enter the data in a form the computer can read.  The SAS system allows you to enter data using different methods. 

One example is to enter data by putting each data name in specified columns. This method is called the COLUMN INPUT FORMAT.  This is the most common method.  

In the following example, data is entered in these columns:
NAME 1-18, SEX 20,  AGE 22-23, STWGT 25-27, ENDWGT 29-31, HEIGHT 33-34,   TEAM 36-41
  
            ---------1---------2---------3---------4------

Charlene Armstrong F 35 152 139 66 Yellow
David Shaw         M 27 189 165 68 Red
Amelia Serrano     F 50 145 124 65 Yellow
Ann Nance          F 31 210 192 72 Red
Ravi Sinha         M 48 194 177 67 Yellow
Ashley McKnight    F 26 127 118 62 Red

Jim Brown          M 41 220     73 Yellow
Susan Stewart      F 29 135 126 63 Red
Rose Collins       F 37 155 141 67 Blue
Jason Schock       M 28 187 172 77 Green
Kanoko Nagasaka    F 46 135 122 66 Blue
Richard Rose       M 33 181 166 72 Green
David Sims         M 50 280 300 70 Blue
Elizabeth Sims     F 48 300 200 65 Green
Tim Jones          M 35 280 168 70 Blue
Larry Goss         M 21 188 174 73 Green
Asha Garg          M 56 148 132 61 Yellow
Jennifer Brooks    F 42 208 165 72 Red

You can also enter data by separating each value with a space.  This method is referred to as LIST INPUT FORMAT

Charlene Armstrong F 35 152 139 66 Yellow
David Shaw M 27 189 165 68 Red
Amelia Serrano F 50 145 124 65 Yellow
Ann Nance F 31 210 192 72 Red
Ravi Sinha M 48 194 177 67 Yellow
Ashley McKnight F 26 127 118 62 Red
Jim Brown M 41 220 . 73 Yellow
Susan Stewart F 29 135 126 63 Red
Rose Collins F 37 155 141 67 Blue
Jason Schock M 28 187 172 77 Green
Kanoko Nagasaka F 46 135 122 66 Blue
Richard Rose M 33 181 166 72 Green
David Sims M 50 280 300 70 Blue
Elizabeth Sims F 48 300 200 65 Green
Tim Jones M 35 280 168 70 Blue
Larry Goss M 21 188 174 73 Green
Asha Garg  M 56 148 132 61 Yellow
Jennifer Brooks F 42 208 165 72 Red

You will learn more about the differences between the two types of input later.

SELECTING TASKS FOR PROGRAMS

Before you write a program, you need to determine what tasks you want the SAS system to perform.  For example, you may want to print out the data set, you may want to produce a graph, or a plot, or add other information to the table.




                                                              SAS PROGRAMS

A SAS program is a group of step-by-step instructions, also known as SAS statements that instruct the computer to perform specific tasks.


                                                   PARTS OF A SAS PROGRAM

SAS                                         data htwt;
statements                              input name $ 1-18 sex $ 20 age 22-23 stwgt 25-27 endwgt 29-31
height 33-34   team $ 36-41;
                                                datalines;                 
Charlene Armstrong F 35 152 139 66 Yellow
David Shaw         M 27 189 165 68 Red
Data                                                    Amelia Serrano     F 50 145 124 65 Yellow
lines                                                     Ann Nance          F 31 210 192 72 Red
Ravi Sinha         M 48 194 177 67 Yellow
Ashley McKnight    F 26 127 118 62 Red
Jim Brown          M 41 220     73 Yellow
Susan Stewart      F 29 135 126 63 Red
Rose Collins       F 37 155 141 67 Blue
Jason Schock       M 28 187 172 77 Green
Kanoko Nagasaka    F 46 135 122 66 Blue
Richard Rose       M 33 181 166 72 Green
David Sims         M 50 280 300 70 Blue
Elizabeth Sims     F 48 300 200 65 Green
Tim Jones          M 35 280 168 70 Blue
Larry Goss         M 21 188 174 73 Green
Asha Garg          M 56 148 132 61 Yellow
Jennifer Brooks    F 42 208 165 72 Red
                                                ;
                                                run;
SAS           
statements                              proc print data=htwt;
                                                run;
                                                proc plot data=htwt;
                                                plot height*stwgt;
                                                run;

SAS statements usually begin with a SAS keyword that identifies the type of statement being used.  Common SAS keywords are DATA, INPUT and PROC.  The remainder of the statement contains additional information required for the system to perform the task.
Note:

All SAS statements end with a semicolon (;). 

 SAS statements also can begin in any column on a line.

Individual statements can occupy one line or can extend across several lines.

 However it is easier to read and follow the program when each statement starts on its own line. Examples are below:

data htwt;
input name $ 1-18 sex $ 20 age 22-23 stwgt 25-27 endwgt 29-31
height 33-34 team $ 36-41;
            datalines;

    or

            data htwt; input name $ 1-18 sex $ 20 age 22-23 stwgt 25-27 endwgt 29-31
height 33-34 team $ 36-41; datalines;

STATEMENTS IN SAS PROGRAMS

LIBNAME      libname cs2331lib ‘C:\CS2331’;

This statement is used to create a SAS library of saved tables for use in future programs.  Using a library allows data tables created from past inputs to be used in new analyses and programs without reprocessing the original data with data statements in the new program.  An example would be:

libname CS2331lib ‘C:\CS2331’;
data CS2331lib.htwt;
input name $ 1-18 sex $ 20 age 22-23 stwgt 25-27 endwgt 29-31
height 33-34 team $ 36-41;

The above example would save a copy of the table htwt in the folder CS2331 on the C: drive of the PC.  The table could be used in a future program merely by including the libname statement and a reference to the table such as:

libname CS2331lib ‘C:\CS2331’;
proc print data=CS2331lib.htwt;


DATA      data htwt;

The first statement is a DATA statement.  A DATA statement instructs the SAS System to read data and organize them into a SAS Table or data set.  A DATA statement consists of the keyword DATA and a user-supplied data set name.  Usually this name should refer to some action that you are doing in this data statement.  In other words, make it meaningful to help you and others who will have to look at the program later.

The names of Data sets can be in uppercase, lowercase or mixed case.  They are case insensitive.  The name HTWT, htwt, and HtWt are all the same to SAS. 

INPUT   input name $ 6-23 team $ 25-30 stwgt 32-34 endwgt 36-38 sex $ 40
age 42-43 height 45-46;

The second statement is an INPUT statement.  It provides the information the SAS system requires to organize data into a SAS data set.  The INPUT statement begins with the keyword INPUT and contains a user-supplied list of column names, types, and if necessary, column locations.  In this case, there are six column names.

Notice that NAME, SEX, and TEAM are followed by a dollar sign ($).  This symbol indicates that NAME, SEX and TEAM are character names with values containing alphabetic characters. The other column names are numeric.

NOTE:   Input names can be either lower case or upper case.  However, when the information is printed out, the names above the columns are listed by how they were entered in the input statement.  Example: in the input above, the output would look like the following:


Obs    name                             team         stwgt       endwgt           sex        age    height    

1    CHARLENE ARMSTRONG    YELLOW     152       139      F      35      66       
2  DAVID SHAW            RED        189       165      M      27      68       
3  AMELIA SERRANO        YELLOW     145       124      F      50      65       
4  ANN  NANCE            RED        210       192      F      31      72       

However, if you use uppercase on some of the column names in the input statement, they will be printed out in capital letters on the output.  This is the input statement:

input NAME  $ 6-23 team $ 25-30 STWGT 32-34 ENDWGT 36-38 sex $ 40
age 42-43 height 45-46;

This is an example of how the output would look:


 Obs    NAME                                team      STWGT     ENDWGT  sex          age    height 

1    CHARLENE ARMSTRONG    YELLOW     152       139      F      35      66      
2    DAVID SHAW            RED        189       165      M      27      68      
3    AMELIA SERRANO        YELLOW     145       124      F      50      65      
4    ANN  NANCE            RED        210       192      F      31      72      


DATALINES

            The DATALINES statement indicates that the data lines follow in the program.  A single semicolon marks the end of the data lines.  There are other ways to insert data in a program. They will be discussed later.

RUN

The RUN statement instructs the system to execute the previous statements.  Although the SAS system does not always require a RUN statement after the datalines and semicolon, it is recommended that you include a RUN statement in this section of your programs.  When you use the PC versions of SAS, you need to use the RUN statement. 

PROC PRINT proc print; or proc print data=htwt;

The PROC PRINT statement instructs the SAS system to print data.  PRINT is a procedure, a prewritten computer program that analyzes and processes data. 



A PROC statement consists of the keyword PROC and the procedure name, such as PRINT.  You can also supply a user-supplied statement such as DATA =.  The DATA = option specifies the table name. 

NOTE:  SAS automatically reads the most recently created SAS table.  The DATA= option enables you to override the system default and specify a data set of your choice. 

PROC PLOT proc plot data=htwt;  plot height*stwgt;

The PROC PLOT statement requests a plot of the data.  The PLOT statement provides the details required to product the plot you want.  The column HEIGHT will be on the vertical axis and the column STWGT will be on the horizontal axis.

SAS OUTPUT

You will receive output from the program.  The first of it will be a SAS Log.  This displays the SAS statements you submitted and contains SAS system messages about the execution of the program.

The PROC PRINT statement produces a first page of output. It will automatically display the number of observations within the SAS data set in the first column of output.   The column names are also supplied by the program.

The PROC PLOT will be on a separate page.  It will show the height-weight points for each row of input.