Which partition type I need to use in the JOIN stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
igorbmartins
Participant
Posts: 161
Joined: Mon Mar 17, 2008 10:33 am

Which partition type I need to use in the JOIN stage

Post by igorbmartins »

Hi guys, someone could help me? I need to do a JOIN, so I usually do this in two ways.

1. I select the patition type as Hash and I select my keys. I do this in both links (in the main link and in the reference link)
2. I select the partition type as Round robin in my main link and I select the partition type as Entire in my reference link.

I always did this in my JOBs and they always works, but today I did this and the result was wrong. The job only worked after I set the option Perform sort in both links.

Someone could explain me why this is happening?

Thanks,
Igor Bastos Martins
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

You don't need to explicitly partition and sort the data, choosing the auto partitioning method will ensure that partitioning and sorting is done.
Thanx and Regards,
ETL User
igorbmartins
Participant
Posts: 161
Joined: Mon Mar 17, 2008 10:33 am

Post by igorbmartins »

Chandra, thanks for your reply, but why the above settings didn't work?

As I told I always did it and always worked.

Thanks,
Igor Bastos Martins
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

As Chandra said, DataStage will usually do necessary sorting and partitioning for you when using the "Auto"-partitioning feature and if you do not prevent adding sort-operators by parameter (APT_NO_SORT_INSERTION). As soon as You switch to manual mode, DataStage assumes that you take control of partitioning and sorting and won't override - unless it has knowledge of wrong sort-decisions (from an explicit sort in the same job that does not meet the requirements of a stage).

Join needs the data correctly partitioned and sorted from all input-links.

Join does not distinguish between stream and reference-data. All input-links are treated as stream. If your set of reference-data is small enough for Entire-Partioning, use Lookup instead of Join. Preloading reference-data to memory is more efficient in these cases and the lookup-stage gives you more options then join.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

chandra.shekhar@tcs.com wrote:You don't need to explicitly partition and sort the data, choosing the auto partitioning method will ensure that partitioning and sorting is done.
Does auto partition will work without flaw for any volume of data?

Especially in JOIN / LOOKUP?
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
bobyon
Premium Member
Premium Member
Posts: 200
Joined: Tue Mar 02, 2004 10:25 am
Location: Salisbury, NC

Post by bobyon »

Personally, I prefer the method you are taking instead of using the auto partitioning that has been previously recommended. Yes, Auto normally works, but I like to be as explicit as possible. It ensures that the developer has considered sorting and partitioning thoroughly.

As to why your job suddenly is not producing accurate results; I could only guess, but from what you have said, it sounds like the data was not being sorted previously.

Here again, I like to be very explicit and will most often insert a sort stage prior to the join (or lookup) rather than using an "inline" sort. Partitioning alone is not sufficient. The data must be both sorted AND partitioned properly to produce accurate results.
Bob
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

Does auto partition will work without flaw for any volume of data?

Especially in JOIN / LOOKUP?
Yes, it works all the time.
In my project we process lot of data(my client is the biggest Bank of India 8)).
We always use auto partitioning and till now there's no discrepancy in the data.
There's NO harm in explicitly partitioning and sorting the data.
As babyon said you have to sort the data too, try it once.
Thanx and Regards,
ETL User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Once more with feeling...

(Auto) partitioning yields the following results.
  • On a sequential-to-parallel link --> Round Robin

    On a parallel-to-parallel link --> Same

    (On a sequential-to-sequential link partitioning is not applicable.)

    On the reference input link to a Lookup stage --> Entire

    On an input link to a DB2 stage --> DB2

    On an input link to a stage that requires key-partitioned input --> Hash
Even though it guarantees that any lookup will work, Entire partitioning may be sub-optimal in a multi-machine configuration, as all rows have to be distributed to all machines. LookupKey-based partitioning algorithm on source and reference inputs may be preferred. In an SMP environment this is not an issue, as one copy of the reference data is loaded into shared memory that all nodes can see.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply