Join stage requirements

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Join stage requirements

Post by pandeesh »

Hi,

My requirement is to join two datasets based on 2 join keys.
Our client asked me to partition based on one key but sort on both join keys.
But they told me i should use inlink sort not explicit sort stage before join.
I wonder how we can do without using sort stage.
I am using hash partition on that column.
That implies, if i check perform sort,the sort will happen based on that column.
As a prerequisite for join stage, the data should be partitioned and sorted on join keys.
But i don't know why they are asking me to partition on only one key.
And one more thing is,if i use hash partitioning on the input data, is there any need of sorting, since i believe in the partition hash itself sorting will also happen.
Correct me if i am wrong.
Thanks
pandeeswaran
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

The input link sort and an explicit sort stage will both use the tsort operator, so operationally there is no difference. I personally prefer using an explicit sort stage as it's more visible and provides more options than an inlink sort, but that's just my preference.

Partitioning and Sorting are two different actions. Performing one does not perform the other. Hash partitioning does not also sort the data.

To setup your inlink sorts, select all of the column you need to sort on and as you have seen, check the Perform Sort option. Then, on each column that you selected, right-click and select Usage... You can choose from three options: "Sorting", "Partitioning" and "Sorting, Partitioning". Select "Sorting, Partitioning" for only your primary key and "Sorting" for the other.

Partitioning on only the primary key is valid...there has never been a requirement that you must partition on all sort keys.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Thanks James!
I believe there wont be any difference even if I use both the join keys for partitioning .
pandeeswaran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Of course there will be a difference. Figure out the partitioning results on paper. The records will definitely be partitioned differently. However, the join results will still be accurate provided that both inputs are partitioned and sorted (on the join keys) identically.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever. For example, if you are counting people by address, then you can partition by postcode (or zip code, or regional equivalent).

Having more keys than is necessary in the partition does add a slight overhead to the calculation of the hash value that is used to partition the data. iIf you have 6 lines of address and a postcode, and the address averages about 100 characters for the whole thing, then you're calculating a hash on 100 characters when you could just calculate it on the 10 characters of the postcode. This becomes even worse if your 6 lines of address are all padded with spaces, you could be calculating the hash of several thousand characters.

Your hash key should therefore be the shortest value that provides the most variety for your data. You may also find that partitioning the data once by a single key will do for several operations, whereas putting too fine a granularity into the key may mean that you need to repartition later on.
Last edited by PhilHibbs on Tue Nov 08, 2011 11:55 am, edited 1 time in total.
Phil Hibbs | Capgemini
Technical Consultant
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever.
Hi Phil,

Did you mean subset of the keys?
Sorry, not trying to be nitpicky here but to make it crystal clear.

Thanks
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

fmou wrote:
Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever.
Hi Phil,

Did you mean subset of the keys?
Sorry, not trying to be nitpicky here but to make it crystal clear.

Thanks
If your join or aggregate key is POSTCODE and HOUSENUM, and you partition by POSTCODE, then your partitions will be a superset of the data sets that you are processing. You don't want to have a subset of your key values in a partition, as the rest of the key values will be in a different partition and won't get processed properly.

I see the source of confusion, though, if you treat the key list as a set of fields then POSTCODE is a subset of (POSTCODE, HOUSENUM) but I was referring to data sets, I was not treating the set criteria itself as a set.
Phil Hibbs | Capgemini
Technical Consultant
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

I see the source of confusion, though, if you treat the key list as a set of fields then POSTCODE is a subset of (POSTCODE, HOUSENUM)
Thanks for the clarification, Phil.
fmou
Participant
Posts: 124
Joined: Sat May 28, 2011 9:48 pm

Post by fmou »

double post deleted.
Post Reply