Join stage requirements
Moderators: chulett, rschirm, roy
Join stage requirements
Hi,
My requirement is to join two datasets based on 2 join keys.
Our client asked me to partition based on one key but sort on both join keys.
But they told me i should use inlink sort not explicit sort stage before join.
I wonder how we can do without using sort stage.
I am using hash partition on that column.
That implies, if i check perform sort,the sort will happen based on that column.
As a prerequisite for join stage, the data should be partitioned and sorted on join keys.
But i don't know why they are asking me to partition on only one key.
And one more thing is,if i use hash partitioning on the input data, is there any need of sorting, since i believe in the partition hash itself sorting will also happen.
Correct me if i am wrong.
Thanks
My requirement is to join two datasets based on 2 join keys.
Our client asked me to partition based on one key but sort on both join keys.
But they told me i should use inlink sort not explicit sort stage before join.
I wonder how we can do without using sort stage.
I am using hash partition on that column.
That implies, if i check perform sort,the sort will happen based on that column.
As a prerequisite for join stage, the data should be partitioned and sorted on join keys.
But i don't know why they are asking me to partition on only one key.
And one more thing is,if i use hash partitioning on the input data, is there any need of sorting, since i believe in the partition hash itself sorting will also happen.
Correct me if i am wrong.
Thanks
pandeeswaran
The input link sort and an explicit sort stage will both use the tsort operator, so operationally there is no difference. I personally prefer using an explicit sort stage as it's more visible and provides more options than an inlink sort, but that's just my preference.
Partitioning and Sorting are two different actions. Performing one does not perform the other. Hash partitioning does not also sort the data.
To setup your inlink sorts, select all of the column you need to sort on and as you have seen, check the Perform Sort option. Then, on each column that you selected, right-click and select Usage... You can choose from three options: "Sorting", "Partitioning" and "Sorting, Partitioning". Select "Sorting, Partitioning" for only your primary key and "Sorting" for the other.
Partitioning on only the primary key is valid...there has never been a requirement that you must partition on all sort keys.
Regards,
Partitioning and Sorting are two different actions. Performing one does not perform the other. Hash partitioning does not also sort the data.
To setup your inlink sorts, select all of the column you need to sort on and as you have seen, check the Perform Sort option. Then, on each column that you selected, right-click and select Usage... You can choose from three options: "Sorting", "Partitioning" and "Sorting, Partitioning". Select "Sorting, Partitioning" for only your primary key and "Sorting" for the other.
Partitioning on only the primary key is valid...there has never been a requirement that you must partition on all sort keys.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Of course there will be a difference. Figure out the partitioning results on paper. The records will definitely be partitioned differently. However, the join results will still be accurate provided that both inputs are partitioned and sorted (on the join keys) identically.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever. For example, if you are counting people by address, then you can partition by postcode (or zip code, or regional equivalent).
Having more keys than is necessary in the partition does add a slight overhead to the calculation of the hash value that is used to partition the data. iIf you have 6 lines of address and a postcode, and the address averages about 100 characters for the whole thing, then you're calculating a hash on 100 characters when you could just calculate it on the 10 characters of the postcode. This becomes even worse if your 6 lines of address are all padded with spaces, you could be calculating the hash of several thousand characters.
Your hash key should therefore be the shortest value that provides the most variety for your data. You may also find that partitioning the data once by a single key will do for several operations, whereas putting too fine a granularity into the key may mean that you need to repartition later on.
Having more keys than is necessary in the partition does add a slight overhead to the calculation of the hash value that is used to partition the data. iIf you have 6 lines of address and a postcode, and the address averages about 100 characters for the whole thing, then you're calculating a hash on 100 characters when you could just calculate it on the 10 characters of the postcode. This becomes even worse if your 6 lines of address are all padded with spaces, you could be calculating the hash of several thousand characters.
Your hash key should therefore be the shortest value that provides the most variety for your data. You may also find that partitioning the data once by a single key will do for several operations, whereas putting too fine a granularity into the key may mean that you need to repartition later on.
Last edited by PhilHibbs on Tue Nov 08, 2011 11:55 am, edited 1 time in total.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
If your join or aggregate key is POSTCODE and HOUSENUM, and you partition by POSTCODE, then your partitions will be a superset of the data sets that you are processing. You don't want to have a subset of your key values in a partition, as the rest of the key values will be in a different partition and won't get processed properly.fmou wrote:Hi Phil,Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever.
Did you mean subset of the keys?
Sorry, not trying to be nitpicky here but to make it crystal clear.
Thanks
I see the source of confusion, though, if you treat the key list as a set of fields then POSTCODE is a subset of (POSTCODE, HOUSENUM) but I was referring to data sets, I was not treating the set criteria itself as a set.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant