Join stage requirements

pandeesh · Post by **pandeesh** » Fri Nov 04, 2011 11:03 am

Hi,

My requirement is to join two datasets based on 2 join keys.
Our client asked me to partition based on one key but sort on both join keys.
But they told me i should use inlink sort not explicit sort stage before join.
I wonder how we can do without using sort stage.
I am using hash partition on that column.
That implies, if i check perform sort,the sort will happen based on that column.
As a prerequisite for join stage, the data should be partitioned and sorted on join keys.
But i don't know why they are asking me to partition on only one key.
And one more thing is,if i use hash partitioning on the input data, is there any need of sorting, since i believe in the partition hash itself sorting will also happen.
Correct me if i am wrong.
Thanks

jwiles · Post by **jwiles** » Fri Nov 04, 2011 1:39 pm

The input link sort and an explicit sort stage will both use the tsort operator, so operationally there is no difference. I personally prefer using an explicit sort stage as it's more visible and provides more options than an inlink sort, but that's just my preference.

Partitioning and Sorting are two different actions. Performing one does not perform the other. Hash partitioning does not also sort the data.

To setup your inlink sorts, select all of the column you need to sort on and as you have seen, check the Perform Sort option. Then, on each column that you selected, right-click and select Usage... You can choose from three options: "Sorting", "Partitioning" and "Sorting, Partitioning". Select "Sorting, Partitioning" for only your primary key and "Sorting" for the other.

Partitioning on only the primary key is valid...there has never been a requirement that you must partition on all sort keys.

Regards,

pandeesh · Post by **pandeesh** » Sat Nov 05, 2011 3:49 am

Thanks James!
I believe there wont be any difference even if I use both the join keys for partitioning .

ray.wurlod · Post by **ray.wurlod** » Sat Nov 05, 2011 5:58 am

Of course there will be a difference. Figure out the partitioning results on paper. The records will definitely be partitioned differently. However, the join results will still be accurate provided that both inputs are partitioned and sorted (on the join keys) identically.

PhilHibbs · Post by **PhilHibbs** » Tue Nov 08, 2011 11:53 am

Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever. For example, if you are counting people by address, then you can partition by postcode (or zip code, or regional equivalent).

Having more keys than is necessary in the partition does add a slight overhead to the calculation of the hash value that is used to partition the data. iIf you have 6 lines of address and a postcode, and the address averages about 100 characters for the whole thing, then you're calculating a hash on 100 characters when you could just calculate it on the 10 characters of the postcode. This becomes even worse if your 6 lines of address are all padded with spaces, you could be calculating the hash of several thousand characters.

Your hash key should therefore be the shortest value that provides the most variety for your data. You may also find that partitioning the data once by a single key will do for several operations, whereas putting too fine a granularity into the key may mean that you need to repartition later on.

fmou · Post by **fmou** » Wed Nov 09, 2011 10:19 am

Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever.

Hi Phil,

Did you mean subset of the keys?
Sorry, not trying to be nitpicky here but to make it crystal clear.

Thanks

PhilHibbs · Post by **PhilHibbs** » Wed Nov 09, 2011 10:34 am

fmou wrote:
Partitioning must be a superset of the key that you are using for joining, or aggregating, or whatever.
Hi Phil,

Did you mean subset of the keys?
Sorry, not trying to be nitpicky here but to make it crystal clear.

Thanks

If your join or aggregate key is POSTCODE and HOUSENUM, and you partition by POSTCODE, then your partitions will be a superset of the data sets that you are processing. You don't want to have a subset of your key values in a partition, as the rest of the key values will be in a different partition and won't get processed properly.

I see the source of confusion, though, if you treat the key list as a set of fields then POSTCODE is a subset of (POSTCODE, HOUSENUM) but I was referring to data sets, I was not treating the set criteria itself as a set.

fmou · Post by **fmou** » Wed Nov 09, 2011 11:06 am

I see the source of confusion, though, if you treat the key list as a set of fields then POSTCODE is a subset of (POSTCODE, HOUSENUM)

Thanks for the clarification, Phil.

fmou · Post by **fmou** » Wed Nov 09, 2011 11:08 am

double post deleted.