Partitioning on the Join Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Partitioning on the Join Stage

Post by jzparad »

Hi all,

Could someone please clarify whether or not the join stage requires that data be hash partitioned and sorted on the join key?

The following is an extract from the manuals and, to me, it reads like you can choose auto partitioning.

<i>The data sets input to the Join stage must be key partitioned and
sorted. This ensures that rows with the same key column values are
located in the same partition and will be processed by the same node.
It also minimizes memory requirements because fewer rows need to
be in memory at any one time. Choosing the auto partitioning method
will ensure that partitioning and sorting is done. If sorting and
partitioning are carried out on separate stages before the Join stage,
DataStage in auto mode will detect this and not repartition
(alternatively you could explicitly specify the Same partitioning
method).</i>
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It does.
If you don't, DataStage will insert tsort operators to ensure that they are.
You can see this happening if you dump the score.

Identical key-basd partitioning guarantees that every key can find its buddy. (You can also use Entire partitioning on the reference Data Set to achieve the same goal.)

Rows sorted on each partition means that a very efficient memory management technique can be used.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Thanks Ray!

Just to clarify, are you saying that DataStage will partition AND sort the data based on the join key if you use the "Auto" method?

The reason I'm asking is because in our project all joins are normally hash-partitioned and sorted on the key. However one job slipped through with "Auto" partitioning and no sorting. The job worked through our testing phase and on our initial load but, through sheer chance, we were testing it for another problem and found that the join was not working. We converted to hash partitioning (and sorting) and everything worked fine.

Any thoughts?
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The inserted tsort operator does not do any partitioning.

(Auto) does not guarantee hash partitioning. In fact, it's possible (probable) that it will choose a keyless algorithm, such as Round Robin, or Same. It may be possible to tell from the score which algorithm was chosen.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

This is the reason why the Hash partition is preferred over Auto. If one stream is already partition in previous stages, it may be maintained as Same and the other might be Hash partitioned. Sort might be inserted, but it again doesnt help you much, if the partition are different.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply