Problem with join stage

Bilwakunj · Post by **Bilwakunj** » Tue Apr 12, 2005 8:23 pm

Hi,
I've 2 datasets. The left one has got 12 columns and the righht one has got 5 columns. out of 5 , 4 are the key columns which are common in left as well as right, their names are also same in both.
I'm trying to do the left outer join with data sets by joining on the key columns and in the derivation col, i'm using value of the 5th column from right dataset instead of the original derivation of column from left dataset. Also I'm using "Hash" partion and partitioning on the key columns.
My problem is for the "left outer join", if 2 rows are coming from each dataset with matching key the o/p shows 4 rows, same for the "inner join". If I try to do the same using Auto, there is no o/p even if job is finished. what I'm trying to do is, updation of the col. of the left data set with the right one when there is match on the key columns.So if the left dataset has got 2 rows , i need only 2 in o/p with the updated value from right dataset.
Thanks in advance..

richdhan · Post by **richdhan** » Wed Apr 13, 2005 1:11 am

Hi Bilwakunj,

My first thought is that there might be a problem with the partitioning. Yes you should use hash partitioning on keys for both the input links to the join stage. Check whether the order of keys in the hash partition is same in both the input links.

My suggestion is that you run the job using the default configuration file(which uses single node) using auto partitioning for both the input links of the join stage. If you get the desired results then it is definitely a problem with partitioning you have used.

If you dont get the desired results it might be a problem with the data. It might have duplicates.

HTH
Rich

bcarlson · Post by **bcarlson** » Wed Apr 13, 2005 11:56 am

If what you say is true that you have 2 rows from left and from right, with the same keys, then the output of the join is correct. Take the following rows, for example (ignore code reference, just used code formatting to keep data formatted)...

Code: Select all

Row    Left    Right
1         a        a
2         a        a

In the join (inner or left-outer) the results will be as follows:

row1 left-a joins to row1 right-a and row2 right-a = 2 output records
row2 left-a joins to row1 right-a and row2 right-a = 2 output records

Each row from the left will have 2 joins on the right for a total of 4 output records.

If you are looking for a distinct set of keys in the output, either dedup before the join or after. tsort will do a unique sort, but you don't have a lot of control over which records get kept. We use the remdups stage instead. It's input is a sorted data stream (sort on all key columns and any supplimental columns). The remdups stage then specifies the same key columns and specifies whether to keep the first or last occurrence of a distinct set of keys.

join -> tsort (keys and data fields) -> remdups (keys)

The sort is important. Say you have a result set of customers, phone numbers, and effective date. 1 customer may have multiple phone numbers, some effective at the same time, but you want only 1. Sort by customer (key, sort ascending) and effective date (data, ascending). Then remdups uses customer as the key. If you want the most recent, specify the last occurence in the remdup stage (or first occurence to get the oldest).

Hope this helps.