Which partition type I need to use in the JOIN stage

igorbmartins · Post by **igorbmartins** » Tue Dec 18, 2012 4:46 am

Hi guys, someone could help me? I need to do a JOIN, so I usually do this in two ways.

1. I select the patition type as Hash and I select my keys. I do this in both links (in the main link and in the reference link)
2. I select the partition type as Round robin in my main link and I select the partition type as Entire in my reference link.

I always did this in my JOBs and they always works, but today I did this and the result was wrong. The job only worked after I set the option Perform sort in both links.

Someone could explain me why this is happening?

Thanks,
Igor Bastos Martins

chandra.shekhar@tcs.com · Tue Dec 18, 2012 4:55 am

You don't need to explicitly partition and sort the data, choosing the auto partitioning method will ensure that partitioning and sorting is done.

igorbmartins · Post by **igorbmartins** » Tue Dec 18, 2012 6:26 am

Chandra, thanks for your reply, but why the above settings didn't work?

As I told I always did it and always worked.

Thanks,
Igor Bastos Martins

BI-RMA · Post by **BI-RMA** » Tue Dec 18, 2012 8:49 am

As Chandra said, DataStage will usually do necessary sorting and partitioning for you when using the "Auto"-partitioning feature and if you do not prevent adding sort-operators by parameter (APT_NO_SORT_INSERTION). As soon as You switch to manual mode, DataStage assumes that you take control of partitioning and sorting and won't override - unless it has knowledge of wrong sort-decisions (from an explicit sort in the same job that does not meet the requirements of a stage).

Join needs the data correctly partitioned and sorted from all input-links.

Join does not distinguish between stream and reference-data. All input-links are treated as stream. If your set of reference-data is small enough for Entire-Partioning, use Lookup instead of Join. Preloading reference-data to memory is more efficient in these cases and the lookup-stage gives you more options then join.

SURA · Post by **SURA** » Wed Dec 19, 2012 12:51 pm

chandra.shekhar@tcs.com wrote:You don't need to explicitly partition and sort the data, choosing the auto partitioning method will ensure that partitioning and sorting is done.

Does auto partition will work without flaw for any volume of data?

Especially in JOIN / LOOKUP?

bobyon · Post by **bobyon** » Wed Dec 19, 2012 2:17 pm

Personally, I prefer the method you are taking instead of using the auto partitioning that has been previously recommended. Yes, Auto normally works, but I like to be as explicit as possible. It ensures that the developer has considered sorting and partitioning thoroughly.

As to why your job suddenly is not producing accurate results; I could only guess, but from what you have said, it sounds like the data was not being sorted previously.

Here again, I like to be very explicit and will most often insert a sort stage prior to the join (or lookup) rather than using an "inline" sort. Partitioning alone is not sufficient. The data must be both sorted AND partitioned properly to produce accurate results.

chandra.shekhar@tcs.com · Thu Dec 20, 2012 5:09 am

Does auto partition will work without flaw for any volume of data?

Especially in JOIN / LOOKUP?

Yes, it works all the time.
In my project we process lot of data(my client is the biggest Bank of India

).
We always use auto partitioning and till now there's no discrepancy in the data.
There's NO harm in explicitly partitioning and sorting the data.
As babyon said you have to sort the data too, try it once.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 20, 2012 4:08 pm

Once more with feeling...

(Auto) partitioning yields the following results.

On a sequential-to-parallel link --> Round Robin

On a parallel-to-parallel link --> Same

(On a sequential-to-sequential link partitioning is not applicable.)

On the reference input link to a Lookup stage --> Entire

On an input link to a DB2 stage --> DB2

On an input link to a stage that requires key-partitioned input --> Hash

Even though it guarantees that any lookup will work, Entire partitioning may be sub-optimal in a multi-machine configuration, as all rows have to be distributed to all machines. LookupKey-based partitioning algorithm on source and reference inputs may be preferred. In an SMP environment this is not an issue, as one copy of the reference data is loaded into shared memory that all nodes can see.