MERGE
STATEMENTS
The MERGE statement joins rows from two
or more SAS data sets into single rows in a new SAS data set.
Up to 50 data set names can appear in a
merge statement. There are two MERGE
statements.
One-to-One Merging
The merge joins the first row in
one data set with the first row in another.
If a column occurs in more than one data
set (such as class), only one column of that name occurs in the new data
set. The value of the column is
the value in the data set listed latest (rightmost) in the MERGE
statement that contains the column.
The
two data sets that are involved in the merge are listed below:
1. data driver; 2. data vehicle;
input name $ 1-6 city $ 8-16; input year 1-4 model $
6-12;
datalines; datalines;
cathy portland 1985 sedan
nancy raleigh 1952 jeep
sue nashville 1978 bus
proc print; proc print;
title 'Data Set
Driver'; title
'Data Set Vehicle';
run; run;
OUTPUT ON TWO DATA SETS
Data Set Driver Data
Set Vehicle
obs name city obs year model
1 cathy portland 1 1995 sedan
2 nancy raleigh 2 1982 jeep
3 sue nashville 3 1990 bus
MERGING
ONE TO ONE
Data match;
merge driver vehicle;
proc print;
title 'Data Set Match';
OUTPUT
OF MERGED SET--MATCH
__________________________________________
Data
Set Match
_____________________________________
obs name city year model
1
cathy portland 1995 sedan
2 nancy raleigh 1982 jeep
3
sue nashville 1990 bus
Match-Merging with a BY Statement
If you want to match rows from two or
more data sets based on the values of some columns, use a BY statement. In order to match merge, at least one
column must be common to all data sets and each data set must be sorted
by these columns. An example is
below:
data person; data place;
input name $ 1-4 sex $ 6; input name $1-4
city $ 6-10 region 12;
datalines; datalines;
mary
f jose
erie 5
ann f mary
miami 2
tom
m mary
tampa 7
proc print; ann tampa
6
title 'Data Set Person'; proc print;
title 'Data Set Place';
INPUT
FOR MATCH-MERGE
Data
Set Person Data
Set Place
obs name sex obs name city region
1 mary f 1 jose erie 5
2 ann f 2 mary miami 2
3 tom m 3 mary tampa 7
4 ann tampa 6
The program to merge these two data sets
follows;
Data person;
input name $ 1-4 sex $ 6;
proc sort; by name;
Data place;
input name $ 1-4 $ city $ 6-10 region
12;
proc sort; by name;
Data result;
merge person place;
by name;
proc print;
title 'Data Set Result';
Data Set Result
_______________________________
obs name sex city region
1 ann f tampa 6
2 jose erie
5
3 mary f miami 2
4 mary tampa 7
5
tom m .
You are merging each row in the data set
Person with rows in Place that has a matching value of the common column name.
Notice what happens when one of the data
sets does not have a match for the other one.
There is data missing from one of the rows listed in the output.
Variations of the Merge Statement Using
the (IN=) Column
Sometimes you will only want to keep
rows that appear in all data sets being merged.
You can use the (IN= ) column. A
column is created with the name following the =. The value of the column is 1 if the data set
contributed data to the current row. It
is 0 otherwise. The column is not added
to the data set.
Examples of (IN=) Column
file1 file2
___________________________
name gpa major name sex home gpa
andrew 2.0 comsc andrew m r 2.5
graham 3.9 chem
graham m o
2.0
helen 3.5 busad janet
f r 3.4
janet 2.5 homec thomas m o
3.3
_____________________________
Using the (IN=) on Both Data sets
Proc sort data=file1; by name;
Proc sort data=file2; by name;
data both;
merge
file1(in=in1) file2(in=in2);
by name;
if in1 and in2;
proc print;
title 'Data Set Both';
Data Set Both
_________________________________________
name gpa major sex home
andrew 2.5 comsc m r
graham 2.0 chem m o
janet 3.4 homec f r
__________________________________________
Note:
Only those names that appear in each data set are shown here. The others are not included. HELEN and THOMAS had no match and were not
listed.
Using (IN=) with Only One Data Set
Other variations can be used also. The (In= ) can be used with only one data
set. If used with the first data set, it
will list all of those names in the first list even if there is no match.
However, those in the second set who do not have a match in the first, will
not be listed.
An example is:
Data both; merge file1(in=in1) file2;
by name;
if in1;
proc print;
Data
Set Both
____________________________________________
name gpa major sex home
andrew 2.5 comsc m r
graham 2.0 chem m o
helen 3.5 busad
janet 3.4 homec f r
__________________________________________________
Another variation is when names in the
second list are listed even if there is no match. Those in the first data set that do not have
a match in the second will not be listed.
An example is:
Data both;
Merge file1 file2(in=in2); by
name;
if in2;
proc print;
___________________________________________
name gpa major sex home
andrew 2.5 comsc m r
graham 2.0 chem m o
janet 3.4 homec f r
thomas 3.3 m o
_________________________________________________
Using the (IN=) to find Non-Matches in
Each Data Set
You may also want to list only those
that do not have matches in each data set.
Data both;
merge file1(in=in1) file2(in=in2); by name;
if in1 and in2 then delete;
proc print;
__________________________________________
name gpa major sex home
helen 3.5 busad
thomas 3.3 m o
______________________________________________
Order of Data Sets Makes a Difference
You also want to make sure that you have
the data sets in the order that you want them in. You can reverse the data sets and have a
completely different data set. An
example is:
Data both;
Merge year2 year1;
by name;
proc print;
Data
Both
___________________________________
name gpa major sex
andrew 2.0 comsc m
graham 3.9 chem m
helen 3.5 busad f
janet 2.5 homec f
_________________________________________________________________
1 Reference for
this material is from: McPherson, Iris, "Statistically Speaking," UCC
Newsletter, Oklahoma State University, December 1989, pp. 7-10.
One-to-Many
Match Merge
Sometimes you need to combine two data
sets by matching one observation from one data set with more than one
observation in another. An example of
this is a data set with employees listed in it. There are two columns (LID) –
which is Location of company, and TID- which matches up with a Title
Identification. For this example, the
Location information will be matched.
See the employeeid data set below:
IdNo Last First
LID TID Salary Sex Performance
000-01-0000
Milgrom Pamela L02 T02
57500 F Average
000-02-2222 Adams Jennifer
L01 T04 19500 F Average
111-12-1111
Johnson James L03 T03
85000 M Good
123-45-6789
Coulter Tracey L01 T03 100000 F Good
222-23-2222
Marlin Billy L04 T03 125000 M Good
222-52-5555
Smith Mary L03 T02
42500 F Average
333-34-3333
Manin Ann L02 T01
49500 F Average
333-43-4444
Smith Frank L01 T01
65000 M Good
333-66-1234
Brown Marietta L01 T04
18500 F Poor
444-45-4444
Frank Vernon L04 T01
75000 M Good
555-22-3333
Rubin Patricia L02 T01
45000 F Average
555-56-5555
Charles Kenneth L02 T02
40000 M Poor
776-67-6666
Adamson David L03 T01
52000 M Poor
777-78-7777
Marder Kelly L03 T02
38500 F Average
164-01-0999
Flint Arnold L03 T02
40000 M Average
164-24-7865
Erpf Dan L04 T01
51800 M Average
166-54-6534
Drubin Lolly L01 T02
37000 F Good
167-60-1111
Bocholis Sue L02 T03 115000 F
Good
168-50-2233
Newcomm Henry L03 T02
30900 M Average
The
LocationID data set looks like this:
LID
City Address ST Zip Area Phone
L01 Atlanta 450 Peachtree Road GA 30316 404 333-5555
L02 Boston 3 Commons Blvd MA 02190 617 123-4444
L03 Chicago 500 Loop Highway IL 60620 312 444-6666
L04 Miami 1000 Kirkman Road FL 32801 407 555-5555
These data sets are merged the same way
you would do a one-to-one match merge. Before merging, the data sets need to be sorted
by the common unique columns in them.
You must have a BY statement.
HOWEVER, you do NOT use the IN= in this merge as you did in the
Match-merge.
data
onetomany;
infile
'a:\employeeid.dat' truncover;
input
IdNo $ 1-11 Last $13-21 First $ 22-31
LID $ 32-34
Salary
40-45 Sex $ 47 Performance $ 49-55;
proc
sort;
by LID;
proc
print;
Title
'EMPLOYEE DATA SET SORTED BY LOCATION ID';
run;
data
locations;
infile
'a:\locationid.dat';
input
LID $ 1-3 City $5-16 Address $ 17-35 State $ 37-38 Zip 40-44 area 46-48 Phone $ 50-57;
proc
sort;
by LID;
proc
print;
Title
'Location Id for Employees ';
run;
data
locationmatch;
merge
onetomany locations;
by
LID;
proc
print;
var last first LID city address state;
title
'EMPLOYEES LISTED WITH THEIR LOCATIONS';
run;
The output from this
program follows:
EMPLOYEES
LISTED WITH THEIR LOCATIONS
Obs Last
First LID City Address State
1 Adams Jennifer
L01 Atlanta 450 Peachtree Road GA
2 Coulter Tracey
L01 Atlanta 450 Peachtree Road GA
3 Smith Frank
L01 Atlanta 450 Peachtree Road GA
4 Brown Marietta
L01 Atlanta 450 Peachtree Road GA
5 Drubin Lolly
L01 Atlanta 450 Peachtree Road GA
6 Milgrom Pamela
L02 Boston 3
Commons Blvd MA
7 Manin Ann
L02 Boston 3 Commons Blvd MA
8 Rubin Patricia
L02 Boston 3 Commons Blvd MA
9 Charles Kenneth
L02 Boston 3 Commons Blvd MA
10 Bocholis Sue
L02 Boston 3 Commons Blvd MA
11 Johnson James
L03 Chicago 500 Loop Highway IL
12 Smith Mary
L03 Chicago 500 Loop Highway IL
13 Adamson David
L03 Chicago 500 Loop Highway IL
14 Marder Kelly
L03 Chicago 500 Loop Highway IL
15 Flint Arnold
L03 Chicago 500 Loop Highway IL
16 Newcomm Henry
L03 Chicago 500 Loop Highway IL
17 Marlin Billy
L04 Miami 1000 Kirkman Road FL
18 Frank Vernon
L04 Miami 1000 Kirkman Road FL
19 Erpf Dan L04
Miami 1000 Kirkman Road FL
The
locations data set is separate from the names of the employees because if one
of the locations changes, then only the locations data set needs to be
updated. This means that each employee
that lives in that location in the data set won’t have to be changed
separately. Look at this next example
where L04 will be changed to a new location. It is going to move to Phoenix. The change is only made in the Locations data
set.
LID City Address ST Zip Area Phone
L01 Atlanta 450 Peachtree Road GA 30316 404 333-5555
L02 Boston 3 Commons Blvd MA 02190 617 123-4444
L03 Chicago 500 Loop Highway IL 60620 312 444-6666
L04 Phoenix
1801 Sunny Lane AR 78210 602
881-9978
Notice now when
the program is run again, the employees in location L04 are now located in
Phoenix instead of Miami.
EMPLOYEES LISTED WITH THEIR NEW
LOCATIONS (L04)
Obs Last
First LID City Address State
1 Adams Jennifer
L01 Atlanta 450 Peachtree Road GA
2 Coulter Tracey
L01 Atlanta 450 Peachtree Road GA
3 Smith Frank
L01 Atlanta 450 Peachtree Road GA
4 Brown Marietta
L01 Atlanta
450 Peachtree Road GA
5 Drubin Lolly
L01 Atlanta 450 Peachtree Road GA
6 Milgrom Pamela
L02 Boston 3 Commons Blvd MA
7 Manin Ann
L02 Boston 3 Commons Blvd MA
8 Rubin Patricia
L02 Boston 3 Commons Blvd MA
9 Charles Kenneth
L02 Boston 3 Commons Blvd MA
10 Bocholis Sue
L02 Boston 3 Commons Blvd MA
11
Johnson James L03
Chicago 500 Loop Highway IL
12 Smith Mary
L03 Chicago 500 Loop Highway IL
13 Adamson David
L03 Chicago 500 Loop Highway IL
14 Marder Kelly
L03 Chicago 500 Loop Highway IL
15 Flint Arnold
L03 Chicago 500 Loop Highway IL
16 Newcomm Henry
L03 Chicago 500 Loop Highway IL
17 Marlin Billy
L04 Phoenix 1801 Sunny Lane AR
18 Frank Vernon
L04 Phoenix 1801 Sunny Lane AR
19 Erpf Dan L04
Phoenix 1801 Sunny Lane AR
The merge
statement is one that you should become familiar with as there are many times
that merging data sets together may be used.
No comments:
Post a Comment