SAS E-Learning: SAS Class On Merge

MERGE STATEMENTS

The MERGE statement joins rows from two or more SAS data sets into single rows in a new SAS data set.

Up to 50 data set names can appear in a merge statement. There are two MERGE statements.

One-to-One Merging

The merge joins the first row in one data set with the first row in another.

If a column occurs in more than one data set (such as class), only one column of that name occurs in the new data set. The value of the column is the value in the data set listed latest (rightmost) in the MERGE statement that contains the column.

The two data sets that are involved in the merge are listed below:

1. data driver; 2. data vehicle;

input name $ 1-6 city $ 8-16; input year 1-4 model $ 6-12;

datalines; datalines;

cathy portland 1985 sedan

nancy raleigh 1952 jeep

sue nashville 1978 bus

proc print; proc print;

title 'Data Set Driver'; title 'Data Set Vehicle';

run; run;

OUTPUT ON TWO DATA SETS

Data Set Driver Data Set Vehicle

obs name city obs year model

1 cathy portland 1 1995 sedan

2 nancy raleigh 2 1982 jeep

3 sue nashville 3 1990 bus

MERGING ONE TO ONE

Data match;

merge driver vehicle;

proc print;

title 'Data Set Match';

OUTPUT OF MERGED SET--MATCH

__________________________________________ Data Set Match

_____________________________________

obs name city year model

1 cathy portland 1995 sedan

2 nancy raleigh 1982 jeep

3 sue nashville 1990 bus

Match-Merging with a BY Statement

If you want to match rows from two or more data sets based on the values of some columns, use a BY statement. In order to match merge, at least one column must be common to all data sets and each data set must be sorted by these columns. An example is below:

data person; data place;

input name $ 1-4 sex $ 6; input name $1-4 city $ 6-10 region 12;

datalines; datalines;

mary f jose erie 5

ann f mary miami 2

tom m mary tampa 7

proc print; ann tampa 6

title 'Data Set Person'; proc print;

title 'Data Set Place';

INPUT FOR MATCH-MERGE

Data Set Person Data Set Place

obs name sex obs name city region

1 mary f 1 jose erie 5

2 ann f 2 mary miami 2

3 tom m 3 mary tampa 7

4 ann tampa 6

The program to merge these two data sets follows;

Data person;

input name $ 1-4 sex $ 6;

proc sort; by name;

Data place;

input name $ 1-4 $ city $ 6-10 region 12;

proc sort; by name;

Data result;

merge person place;

by name;

proc print;

title 'Data Set Result';

Data Set Result

_______________________________

obs name sex city region

1 ann f tampa 6

2 jose erie 5

3 mary f miami 2

4 mary tampa 7

5 tom m .

You are merging each row in the data set Person with rows in Place that has a matching value of the common column name.

Notice what happens when one of the data sets does not have a match for the other one. There is data missing from one of the rows listed in the output.

Variations of the Merge Statement Using the (IN=) Column

Sometimes you will only want to keep rows that appear in all data sets being merged. You can use the (IN= ) column. A column is created with the name following the =. The value of the column is 1 if the data set contributed data to the current row. It is 0 otherwise. The column is not added to the data set.

Examples of (IN=) Column

file1 file2

___________________________

name gpa major name sex home gpa

andrew 2.0 comsc andrew m r 2.5

graham 3.9 chem graham m o 2.0

helen 3.5 busad janet f r 3.4

janet 2.5 homec thomas m o 3.3

_____________________________

Using the (IN=) on Both Data sets

Proc sort data=file1; by name;

Proc sort data=file2; by name;

data both;

merge file1(in=in1) file2(in=in2);

by name;

if in1 and in2;

proc print;

title 'Data Set Both';

Data Set Both

_________________________________________

name gpa major sex home

andrew 2.5 comsc m r

graham 2.0 chem m o

janet 3.4 homec f r

__________________________________________

Note: Only those names that appear in each data set are shown here. The others are not included. HELEN and THOMAS had no match and were not listed.

Using (IN=) with Only One Data Set

Other variations can be used also. The (In= ) can be used with only one data set. If used with the first data set, it will list all of those names in the first list even if there is no match. However, those in the second set who do not have a match in the first, will not be listed.

An example is:

Data both; merge file1(in=in1) file2;

by name;

if in1;

proc print;

Data Set Both

____________________________________________

name gpa major sex home

andrew 2.5 comsc m r

graham 2.0 chem m o

helen 3.5 busad

janet 3.4 homec f r

__________________________________________________

Another variation is when names in the second list are listed even if there is no match. Those in the first data set that do not have a match in the second will not be listed. An example is:

Data both;

Merge file1 file2(in=in2); by name;

if in2;

proc print;

___________________________________________

name gpa major sex home

andrew 2.5 comsc m r

graham 2.0 chem m o

janet 3.4 homec f r

thomas 3.3 m o

_________________________________________________

Using the (IN=) to find Non-Matches in Each Data Set

You may also want to list only those that do not have matches in each data set.

Data both;

merge file1(in=in1) file2(in=in2); by name;

if in1 and in2 then delete;

proc print;

__________________________________________

name gpa major sex home

helen 3.5 busad

thomas 3.3 m o

______________________________________________

Order of Data Sets Makes a Difference

You also want to make sure that you have the data sets in the order that you want them in. You can reverse the data sets and have a completely different data set. An example is:

Data both;

Merge year2 year1;

by name;

proc print;

Data Both

___________________________________

name gpa major sex

andrew 2.0 comsc m

graham 3.9 chem m

helen 3.5 busad f

janet 2.5 homec f

^{_________________________________________________________________}

¹ Reference for this material is from: McPherson, Iris, "Statistically Speaking," UCC Newsletter, Oklahoma State University, December 1989, pp. 7-10.

One-to-Many Match Merge

Sometimes you need to combine two data sets by matching one observation from one data set with more than one observation in another. An example of this is a data set with employees listed in it. There are two columns (LID) – which is Location of company, and TID- which matches up with a Title Identification. For this example, the Location information will be matched. See the employeeid data set below:

IdNo Last First LID TID Salary Sex Performance

000-01-0000 Milgrom Pamela L02 T02 57500 F Average

000-02-2222 Adams Jennifer L01 T04 19500 F Average

111-12-1111 Johnson James L03 T03 85000 M Good

123-45-6789 Coulter Tracey L01 T03 100000 F Good

222-23-2222 Marlin Billy L04 T03 125000 M Good

222-52-5555 Smith Mary L03 T02 42500 F Average

333-34-3333 Manin Ann L02 T01 49500 F Average

333-43-4444 Smith Frank L01 T01 65000 M Good

333-66-1234 Brown Marietta L01 T04 18500 F Poor

444-45-4444 Frank Vernon L04 T01 75000 M Good

555-22-3333 Rubin Patricia L02 T01 45000 F Average

555-56-5555 Charles Kenneth L02 T02 40000 M Poor

776-67-6666 Adamson David L03 T01 52000 M Poor

777-78-7777 Marder Kelly L03 T02 38500 F Average

164-01-0999 Flint Arnold L03 T02 40000 M Average

164-24-7865 Erpf Dan L04 T01 51800 M Average

166-54-6534 Drubin Lolly L01 T02 37000 F Good

167-60-1111 Bocholis Sue L02 T03 115000 F Good

168-50-2233 Newcomm Henry L03 T02 30900 M Average

The LocationID data set looks like this:

LID City Address ST Zip Area Phone

L01 Atlanta 450 Peachtree Road GA 30316 404 333-5555

L02 Boston 3 Commons Blvd MA 02190 617 123-4444

L03 Chicago 500 Loop Highway IL 60620 312 444-6666

L04 Miami 1000 Kirkman Road FL 32801 407 555-5555

These data sets are merged the same way you would do a one-to-one match merge. Before merging, the data sets need to be sorted by the common unique columns in them. You must have a BY statement. HOWEVER, you do NOT use the IN= in this merge as you did in the Match-merge.

data onetomany;

infile 'a:\employeeid.dat' truncover;

input IdNo $ 1-11 Last $13-21 First $ 22-31 LID $ 32-34

Salary 40-45 Sex $ 47 Performance $ 49-55;

proc sort;

by LID;

proc print;

Title 'EMPLOYEE DATA SET SORTED BY LOCATION ID';

run;

data locations;

infile 'a:\locationid.dat';

input LID $ 1-3 City $5-16 Address $ 17-35 State $ 37-38 Zip 40-44 area 46-48 Phone $ 50-57;

proc sort;

by LID;

proc print;

Title 'Location Id for Employees ';

run;

data locationmatch;

merge onetomany locations;

by LID;

proc print;

var last first LID city address state;

title 'EMPLOYEES LISTED WITH THEIR LOCATIONS';

run;

The output from this program follows:

EMPLOYEES LISTED WITH THEIR LOCATIONS

Obs Last First LID City Address State

1 Adams Jennifer L01 Atlanta 450 Peachtree Road GA

2 Coulter Tracey L01 Atlanta 450 Peachtree Road GA

3 Smith Frank L01 Atlanta 450 Peachtree Road GA

4 Brown Marietta L01 Atlanta 450 Peachtree Road GA

5 Drubin Lolly L01 Atlanta 450 Peachtree Road GA

6 Milgrom Pamela L02 Boston 3 Commons Blvd MA

7 Manin Ann L02 Boston 3 Commons Blvd MA

8 Rubin Patricia L02 Boston 3 Commons Blvd MA

9 Charles Kenneth L02 Boston 3 Commons Blvd MA

10 Bocholis Sue L02 Boston 3 Commons Blvd MA

11 Johnson James L03 Chicago 500 Loop Highway IL

12 Smith Mary L03 Chicago 500 Loop Highway IL

13 Adamson David L03 Chicago 500 Loop Highway IL

14 Marder Kelly L03 Chicago 500 Loop Highway IL

15 Flint Arnold L03 Chicago 500 Loop Highway IL

16 Newcomm Henry L03 Chicago 500 Loop Highway IL

17 Marlin Billy L04 Miami 1000 Kirkman Road FL

18 Frank Vernon L04 Miami 1000 Kirkman Road FL

19 Erpf Dan L04 Miami 1000 Kirkman Road FL

The locations data set is separate from the names of the employees because if one of the locations changes, then only the locations data set needs to be updated. This means that each employee that lives in that location in the data set won’t have to be changed separately. Look at this next example where L04 will be changed to a new location. It is going to move to Phoenix. The change is only made in the Locations data set.

LID City Address ST Zip Area Phone

L01 Atlanta 450 Peachtree Road GA 30316 404 333-5555

L02 Boston 3 Commons Blvd MA 02190 617 123-4444

L03 Chicago 500 Loop Highway IL 60620 312 444-6666

L04 Phoenix 1801 Sunny Lane AR 78210 602 881-9978

Notice now when the program is run again, the employees in location L04 are now located in Phoenix instead of Miami.

EMPLOYEES LISTED WITH THEIR NEW LOCATIONS (L04)

Obs Last First LID City Address State

1 Adams Jennifer L01 Atlanta 450 Peachtree Road GA

2 Coulter Tracey L01 Atlanta 450 Peachtree Road GA

3 Smith Frank L01 Atlanta 450 Peachtree Road GA

4 Brown Marietta L01 Atlanta 450 Peachtree Road GA

5 Drubin Lolly L01 Atlanta 450 Peachtree Road GA

6 Milgrom Pamela L02 Boston 3 Commons Blvd MA

7 Manin Ann L02 Boston 3 Commons Blvd MA

8 Rubin Patricia L02 Boston 3 Commons Blvd MA

9 Charles Kenneth L02 Boston 3 Commons Blvd MA

10 Bocholis Sue L02 Boston 3 Commons Blvd MA

11 Johnson James L03 Chicago 500 Loop Highway IL

12 Smith Mary L03 Chicago 500 Loop Highway IL

13 Adamson David L03 Chicago 500 Loop Highway IL

14 Marder Kelly L03 Chicago 500 Loop Highway IL

15 Flint Arnold L03 Chicago 500 Loop Highway IL

16 Newcomm Henry L03 Chicago 500 Loop Highway IL

17 Marlin Billy L04 Phoenix 1801 Sunny Lane AR

18 Frank Vernon L04 Phoenix 1801 Sunny Lane AR

19 Erpf Dan L04 Phoenix 1801 Sunny Lane AR

The merge statement is one that you should become familiar with as there are many times that merging data sets together may be used.

SAS E-Learning

Friday, 23 November 2012

SAS Class On Merge

No comments:

Post a Comment

About Me