Friday, 23 November 2012

SAS Class On Merge


                                  MERGE STATEMENTS

The MERGE statement joins rows from two or more SAS data sets into single rows in a new SAS data set.

Up to 50 data set names can appear in a merge statement.  There are two MERGE statements. 

One-to-One Merging 

The merge joins the first row in one data set with the first row in another. 

If a column occurs in more than one data set (such as class), only one column of that name occurs in the new data set.  The value of the column is the value in the data set listed latest (rightmost) in the MERGE statement that contains the column. 

The two data sets that are involved in the merge are listed below:

1.  data driver;                                                 2.  data vehicle;                                  
input name $ 1-6 city $ 8-16;                          input year 1-4 model $ 6-12;
datalines;                                                         datalines;
cathy               portland                                 1985    sedan
nancy              raleigh                                                1952    jeep    
sue                   nashville                                 1978    bus     
proc print;                                                        proc print;
title 'Data Set Driver';                                                                         title 'Data Set Vehicle';
run;                                                                                                                              run;     
                                                                             

OUTPUT ON TWO DATA SETS

    Data Set Driver                                         Data Set Vehicle
                                                                                                                                   
obs       name                city                              obs       year     model

1          cathy               portland                       1          1995    sedan
2                                                                      nancy               raleigh                         2          1982    jeep
3          sue                   nashville                      3          1990    bus
                                                                                                                                    






                                                       MERGING ONE TO ONE

Data match;
merge driver vehicle;
proc print;
title 'Data Set Match';

                                           OUTPUT OF MERGED SET--MATCH
__________________________________________                                                                                                   Data Set Match
            _____________________________________                                                                                
obs       name    city                  year     model

1          cathy   portland           1995    sedan
2                                              nancy   raleigh             1982    jeep
3          sue       nashville          1990    bus
                                                                                            

Match-Merging with a BY Statement

If you want to match rows from two or more data sets based on the values of some columns, use a BY statement.  In order to match merge, at least one column must be common to all data sets and each data set must be sorted by these columns.  An example is below:

data person;                                                     data place;
input name $ 1-4 sex $ 6;                                input name $1-4 city $ 6-10 region 12;
datalines;                                                         datalines;
mary  f                                                            jose erie 5
ann   f                                                                                                                         mary miami 2
tom   m                                                           mary tampa 7
proc print;                                                        ann tampa 6
title 'Data Set Person';                                     proc print;
title 'Data Set Place';

                                                    INPUT FOR MATCH-MERGE
 Data Set Person                                            Data Set Place           
obs       name    sex                                           obs       name    city      region

1          mary    f                                               1          jose      erie         5
2          ann      f                                               2          mary    miami     2
3          tom      m                                             3          mary    tampa     7
4          ann      tampa     6
                                                                                                                   

The program to merge these two data sets follows;

Data person;
  input name $ 1-4 sex $ 6;
proc sort; by name;

Data place;
input name $ 1-4 $ city $ 6-10 region 12;
proc sort; by name;

Data result;
merge person place; 
by name;
proc print;
title 'Data Set Result';

Data Set Result                                              
_______________________________                                                              
obs       name    sex       city      region

1          ann      f           tampa    6
2          jose                 erie         5
3          mary   f           miami     2
4          mary               tampa    7
5          tom      m                         .
                                                                                   

You are merging each row in the data set Person with rows in Place that has a matching value of the common column name.

Notice what happens when one of the data sets does not have a match for the other one.  There is data missing from one of the rows listed in the output.

Variations of the Merge Statement Using the (IN=) Column

Sometimes you will only want to keep rows that appear in all data sets being merged.  You can use the (IN= ) column.  A column is created with the name following the =.  The value of the column is 1 if the data set contributed data to the current row.  It is 0 otherwise.  The column is not added to the data set.







Examples of (IN=) Column
file1                                                                 file2
                                                                        ___________________________                                                
name               gpa      major                          name               sex       home   gpa

andrew           2.0       comsc                          andrew           m         r          2.5
graham           3.9       chem                           graham           m         o          2.0
helen               3.5       busad                          janet                f           r          3.4
janet                2.5       homec                         thomas            m         o          3.3
                                                                     _____________________________                                                  

Using the (IN=) on Both Data sets

Proc sort data=file1; by name;
Proc sort data=file2; by name;
data both;
  merge file1(in=in1) file2(in=in2);
by name;
if in1 and in2;
proc print;
title 'Data Set Both';

Data Set Both           
_________________________________________                                                                      
name               gpa      major              sex       home 

andrew           2.5       comsc              m         r
                                    graham           2.0       chem               m         o
                                    janet                3.4       homec f           r                                 
__________________________________________                                                                       

Note:  Only those names that appear in each data set are shown here.  The others are not included.  HELEN and THOMAS had no match and were not listed.

Using (IN=) with Only One Data Set

Other variations can be used also.  The (In= ) can be used with only one data set.  If used with the first data set, it will list all of those names in the first list even if there is no match. However, those in the second set who do not have a match in the first, will not be listed.
An example is:


Data both; merge file1(in=in1) file2;
by name;
if in1;
proc print;
                                                                  Data Set Both
          ____________________________________________                                                                                
name               gpa      major              sex       home

andrew           2.5       comsc              m         r
graham           2.0       chem               m         o
helen               3.5       busad
janet                3.4       homec f           r
__________________________________________________                                                                                             
                                                                    
Another variation is when names in the second list are listed even if there is no match.  Those in the first data set that do not have a match in the second will not be listed.  An example is:

Data both;
  Merge file1 file2(in=in2);  by name;
if in2;
proc print;
           ___________________________________________                                                                               
name               gpa      major              sex       home

andrew           2.5       comsc              m         r
graham           2.0       chem               m         o
janet                3.4       homec f           r
thomas            3.3                               m         o
_________________________________________________                                                                                          
Using the (IN=) to find Non-Matches in Each Data Set

You may also want to list only those that do not have matches in each data set. 

Data both;
  merge file1(in=in1) file2(in=in2); by name;
if in1 and in2 then delete;
proc print;     

                                                       __________________________________________                                                                                            
name               gpa      major              sex       home

helen               3.5       busad
thomas            3.3                               m         o
      ______________________________________________                                                                                   

Order of Data Sets Makes a Difference

You also want to make sure that you have the data sets in the order that you want them in.  You can reverse the data sets and have a completely different data set.  An example is:

Data both; 
  Merge year2 year1;
by name;
proc print;
                                                                     Data Both
 ___________________________________                                                                 
name                gpa      major               sex

andrew           2.0       comsc              m
graham           3.9       chem               m
helen               3.5       busad              f
janet                2.5       homec             f
                                                                 
                                              _________________________________________________________________
1 Reference for this material is from: McPherson, Iris, "Statistically Speaking," UCC Newsletter, Oklahoma State University, December 1989, pp. 7-10.

One-to-Many Match Merge

            Sometimes you need to combine two data sets by matching one observation from one data set with more than one observation in another.  An example of this is a data set with employees listed in it. There are two columns (LID) – which is Location of company, and TID- which matches up with a Title Identification.  For this example, the Location information will be matched.  See the employeeid data set below:   





      IdNo        Last     First     LID TID Salary Sex Performance
000-01-0000 Milgrom  Pamela    L02 T02  57500 F Average
000-02-2222 Adams    Jennifer  L01 T04  19500 F Average
111-12-1111 Johnson  James     L03 T03  85000 M Good
123-45-6789 Coulter  Tracey    L01 T03 100000 F Good
222-23-2222 Marlin   Billy     L04 T03 125000 M Good
222-52-5555 Smith    Mary      L03 T02  42500 F Average
333-34-3333 Manin    Ann       L02 T01  49500 F Average
333-43-4444 Smith    Frank     L01 T01  65000 M Good
333-66-1234 Brown    Marietta  L01 T04  18500 F Poor
444-45-4444 Frank    Vernon    L04 T01  75000 M Good
555-22-3333 Rubin    Patricia  L02 T01  45000 F Average
555-56-5555 Charles  Kenneth   L02 T02  40000 M Poor
776-67-6666 Adamson  David     L03 T01  52000 M Poor
777-78-7777 Marder   Kelly     L03 T02  38500 F Average
164-01-0999 Flint    Arnold    L03 T02  40000 M Average 
164-24-7865 Erpf     Dan       L04 T01  51800 M Average 
166-54-6534 Drubin   Lolly     L01 T02  37000 F Good    
167-60-1111 Bocholis Sue       L02 T03 115000 F Good    
168-50-2233 Newcomm  Henry     L03 T02  30900 M Average 

The LocationID data set looks like this:

      LID  City       Address             ST Zip   Area Phone
L01 Atlanta     450 Peachtree Road  GA 30316 404 333-5555
L02 Boston      3 Commons Blvd      MA 02190 617 123-4444
L03 Chicago     500 Loop Highway    IL 60620 312 444-6666
L04 Miami       1000 Kirkman Road   FL 32801 407 555-5555

These data sets are merged the same way you would do a one-to-one match merge.  Before merging, the data sets need to be sorted by the common unique columns in them.  You must have a BY statement.   HOWEVER, you do NOT use the IN= in this merge as you did in the Match-merge. 

data onetomany;
infile 'a:\employeeid.dat' truncover;
input IdNo $ 1-11 Last $13-21 First $ 22-31  LID $ 32-34
Salary 40-45 Sex $ 47 Performance $ 49-55; 
proc sort;
   by LID;
proc print;
Title 'EMPLOYEE DATA SET SORTED BY LOCATION ID';
run;


data locations;
infile 'a:\locationid.dat';
input LID $ 1-3 City $5-16 Address $ 17-35 State $ 37-38 Zip 40-44  area 46-48 Phone $ 50-57;
proc sort;
  by LID;
proc print;
Title 'Location Id for Employees ';
run;

data locationmatch;
merge onetomany locations;
by LID;
proc print;
  var last first LID city address state;
title 'EMPLOYEES LISTED WITH THEIR LOCATIONS';
run;


The output from this program follows:


EMPLOYEES LISTED WITH THEIR LOCATIONS

      Obs    Last        First    LID     City           Address          State

       1    Adams       Jennifer  L01    Atlanta    450 Peachtree Road     GA
       2    Coulter     Tracey    L01    Atlanta    450 Peachtree Road     GA
       3    Smith       Frank     L01    Atlanta    450 Peachtree Road     GA
       4    Brown       Marietta  L01    Atlanta    450 Peachtree Road     GA
       5    Drubin      Lolly     L01    Atlanta    450 Peachtree Road     GA
       6    Milgrom     Pamela    L02    Boston     3 Commons Blvd         MA
       7    Manin       Ann       L02    Boston     3 Commons Blvd         MA
       8    Rubin       Patricia  L02    Boston     3 Commons Blvd         MA
       9    Charles     Kenneth   L02    Boston     3 Commons Blvd         MA
      10    Bocholis    Sue       L02    Boston     3 Commons Blvd         MA
      11    Johnson     James     L03    Chicago    500 Loop Highway       IL
      12    Smith       Mary      L03    Chicago    500 Loop Highway       IL
      13    Adamson     David     L03    Chicago    500 Loop Highway       IL
      14    Marder      Kelly     L03    Chicago    500 Loop Highway       IL
      15    Flint       Arnold    L03    Chicago    500 Loop Highway       IL
      16    Newcomm     Henry     L03    Chicago    500 Loop Highway       IL
      17    Marlin      Billy     L04    Miami      1000 Kirkman Road      FL
      18    Frank       Vernon    L04    Miami      1000 Kirkman Road      FL
      19    Erpf        Dan       L04    Miami      1000 Kirkman Road      FL


            The locations data set is separate from the names of the employees because if one of the locations changes, then only the locations data set needs to be updated.  This means that each employee that lives in that location in the data set won’t have to be changed separately.  Look at this next example where L04 will be changed to a new location.   It is going to move to Phoenix.  The change is only made in the Locations data set.



LID  City       Address             ST Zip   Area Phone
L01 Atlanta     450 Peachtree Road  GA 30316 404 333-5555
L02 Boston      3 Commons Blvd      MA 02190 617 123-4444
L03 Chicago     500 Loop Highway    IL 60620 312 444-6666
L04 Phoenix     1801 Sunny Lane     AR 78210 602 881-9978

Notice now when the program is run again, the employees in location L04 are now located in Phoenix instead of Miami.     


             EMPLOYEES LISTED WITH THEIR NEW LOCATIONS (L04)

Obs    Last        First   LID     City           Address          State

       1    Adams       Jennifer  L01    Atlanta    450 Peachtree Road     GA
       2    Coulter     Tracey    L01    Atlanta    450 Peachtree Road     GA
       3    Smith       Frank     L01    Atlanta    450 Peachtree Road     GA
       4    Brown       Marietta  L01    Atlanta    450 Peachtree Road     GA
       5    Drubin      Lolly     L01    Atlanta    450 Peachtree Road     GA
       6    Milgrom     Pamela    L02    Boston     3 Commons Blvd         MA
       7    Manin       Ann       L02    Boston     3 Commons Blvd         MA
       8    Rubin       Patricia  L02    Boston     3 Commons Blvd         MA
       9    Charles     Kenneth   L02    Boston     3 Commons Blvd         MA
      10    Bocholis    Sue       L02    Boston     3 Commons Blvd         MA
      11    Johnson     James     L03    Chicago    500 Loop Highway       IL
      12    Smith       Mary      L03    Chicago    500 Loop Highway       IL
      13    Adamson     David     L03    Chicago    500 Loop Highway       IL
      14    Marder      Kelly     L03    Chicago    500 Loop Highway       IL
      15    Flint       Arnold    L03    Chicago    500 Loop Highway       IL
      16    Newcomm     Henry     L03    Chicago    500 Loop Highway       IL
      17    Marlin      Billy     L04    Phoenix    1801 Sunny Lane         AR
      18    Frank       Vernon    L04    Phoenix    1801 Sunny Lane        AR
      19    Erpf        Dan       L04    Phoenix    1801 Sunny Lane        AR

      The merge statement is one that you should become familiar with as there are many times that merging data sets together may be used.





No comments:

Post a Comment