Partitioning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Partitioning

Post by xinhuang66 »

I am very confused about partitioning in DS Parallel jobs recently.

Before, I feel key based partitioning is always needed in Join, remove duplicated, aggregator, merge stages, etc if job is running on multiple nodes. The reason is simply that key-based partition is the way to guarantee the same key values data falling into the same node and processsing. Otherwise, it will generate wrong result. It seems very true based on my test before.

Now, I see more and more jobs using auto partitioning (leave as default) in Join Stage, aggregator stages, etc.

Just called IBM Tech Support, IBM Tech Support says it depends, not sure what it depends on. IBM support also said join stage actually runs on single node regardless of what you do.

Anyone can help me clarify these questions.
1) Is key based partition needed in parallel jobs ?
2) Is Join stage running on single node ?
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

Post by sendmkpk »

plz have a search in forum, this is extensively covered
Praveen
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

1. key based partitioning is needed
2. It will run on a single node if 'YOU' want it to else it can run on multiple nodes.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be sub-optimal (for example other key-based algorithms than Hash exist).

The nodes on which a Join stage runs is determined first by the number of nodes in the configuration file and subsequently by certain stage properties, particularly the execution mode, but also whether it is constrained to execute in a node pool.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

ray.wurlod wrote:Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be ...
This means to me, in general, explicit key based partition is not needed since parallel job will work anyway, there might be a few performance sacrifice. Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.

By the way, I have read most of posts in this regard, however, there is no any post clear discuss if this is needed or not....

I see some post said their aggregator generate duplicates while using auto partition, key based partitioning fixed the issue. However, based on what Ray said and what I see in many bluechip company warehouse system, aggregator with auto works.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You are reasonably correct. Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning. As part of job performance tuning, your job may be able to take advantage of specific key partitioning techniques (which auto partitioning won't provide) to increase the throughput of the job. Or, your business rules may have implemented some logic for which auto partitioning may create incorrect results. In both cases, you typically would explicitly partition on fewer columns that auto partitioning would. These are more advanced development topics which are discussed in IBM's RedBooks on parallel job design and performance tuning.

Proper data partitioning has essentially two goals: 1) Meeting the needs of the implemented business rules (most important) while 2) Distributing data as evenly/efficiently as possible to reach goal #1. Auto partitioning works great for probably 80-90% of jobs.

Aggregator produces duplicates typically when the developer does not understand the partitioning requirements of the aggregations and/or does not have a clear understanding of what partitioning is doing and has been instructed to use same partitioning (I have seen this before :( ).

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

This topic has been discussed so much times and as xinhuang66 said it is still unclear to me too.
I'll take my project for an example, we have developed more than 1000 jobs here which contains atleast joins or aggregators etc.
For these stages, we have never used explicit partitioning methods.
We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results.
Aggregator produces duplicates typically when the ...
Even in Aggregators we never ever found any duplicate issue.

As per my knowledge, Key based partitioning is good to use but it is not mandatory for joins and aggregators.(please correct me if I am wrong)

So I request all the Gurus of datastage here that please clear the confusion.
Thanx and Regards,
ETL User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's mandatory for quite a few stages, including Join. Aggregator is not one of them, however incorrect results will obtain if you aggregate on incorrectly partitioned data. A deliberately constructed bad example produced a report on four nodes showing all 200 states of the USA!

As noted earlier, (Auto) will always produce correct results. Maybe just not quite as efficiently as a properly partitioned design.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

ray.wurlod wrote:It's mandatory for quite a few stages, including Join
Hi Ray

For join ( massive volume ) if the user choose auto partition what will be the outcome !


Hi Chandra

Consider Rays comment. We had the same discussion long time back (Herewith the link: viewtopic.php?t=148321&highlight=Auto+partition ) .

See your comment !
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

xinhuang66 wrote:Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.
If you decided not to use Auto even for a single stage , then plan it accordingly and do your own partition approach from start to end, use same wherever you can may increase performance .
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@Sura,
Thats what I said earlier, and I am still saying the same :(
It is not mandatory to use explicit partitioning methods and there's no harm too in doing that. it all depends upon the developer and the requirement.
ray.wurlod wrote:
It's mandatory for quite a few stages, including Join
For above comment and for your query, I said earlier too and I am saying the same again.
We have some jobs where we join atleast 1 Billion(100 Crores) records in 1 join or multiple joins. Still we never got incorrect output. We have always used auto partitioning and never used explicit partitioning for joins.

(Cant see Ray's comment as it's premium content) :(
Thanx and Regards,
ETL User
srini.dw
Premium Member
Premium Member
Posts: 186
Joined: Fri Aug 18, 2006 1:59 am
Location: Chennai

Post by srini.dw »

1) Is key based partition needed in parallel jobs ?
Keyed partitioning is used when business rules (for example,
Remove Duplicates) or stage requirements (for example, Join) require
processing on groups of related records

2) Is Join stage running on single node ?
If the configuration file is having 2 nodes, how can Join stage runs on single node.

Thanks,
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

To all who are questioning: The "Auto" partitioning method simply instructs the parallel engine (i.e. Orchestrate) to choose an appropriate partition method at runtime, provided that Auto Partitioning has not been overridden or disabled by "Preserve Partitioning", $APT_NO_PART_INSERTION or so on. You can examine the Job Score and OSH in the job log (if you have set the required environment variables to display them) to see what partition method has been chosen. Auto partitioning effectively takes the partitioning decision/choice out of the developer's hands and places it in the engine's hands.

@SURA:
For join ( massive volume ) if the user choose auto partition what will be the outcome
In front of a Join, Auto will choose a keyed partition method (I believe it will always choose Hash) because that's what Join requires (see the product docs). It will partition on all keys defined in the Join options. A sort will be inserted if necessary (if sort insertion hasn't been disabled).

@srini.dw:
If the configuration file is having 2 nodes, how can Join stage runs on single node
Any parallel stage can be forced to run on a single node regardless of the number of nodes defined in the configuration file by setting it's Execution Mode to Sequential or by using Node Constraints (very rare). Documented in the product docs.

@chandra:
We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results
Great! That means that you are using auto partitioning--and the options that affect it--correctly.
Key based partitioning is good to use but it is not mandatory for joins and aggregators
Incorrect! It is required for Join, Merge, Slowly Changing Dimension and maybe a handful of others, as is documented. Aggregator doesn't require keyed partitioning to run, but if you're using the Sort option in Aggregator and Auto partitioning, I bet that most, if not all, of the time Hash partitioning (i.e. keyed) is being chosen for you :)

Remember that even though the JOB is running in parallel, the individual instances of the operators (one in each node) are operating sequentially on a sequential stream of data (one in each node). That is the essence of partition parallelism...data divided (partitioned) into multiple streams being processed concurrently.

End of line.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

See, I read all posts here, still confused, even after I called IBM.

1) JOIN stage actually running on single node is from IBM support guy, he mentioned this is a bug and also suggested me use merge instead of join.

2) If aggregator doesn't need explicit key based partitioning, I don't understand why remove and join stage need it. DS can do the same for all of them, basically, read the keys specifiec in stage property, then hash partition since the developer's choice is auto (whatever) .

3) If explicit key based partitioning is mandatory for Join stage, how to explain Chandra's case, there are 1000+ jobs, with large data volume, all are using auto without any problem. Does any super guru here want to do an ETL health check for them.

4) My experience is I see a few large warehouses, their datastage is using auto for almost everything, at least always for Join. They run this for many years, Is it wrong ??

By the way, can someone come up with a simple example, two flat files join. with auto partition, it will generate wrong results.. ??
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

@James,
Thats what I am saying, if I use Auto partitioning method in Join, Merge etc stages then datastage by itself will use Hash partition.
Thats why we dont have to explicitly use Hash partitioning method.
The same is documented too, pasting here a part of the document
The data sets input to the Join stage must be key partitioned and sorted in ascending order. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done
.

For Stages like Join, Merge, Change Capture, Remove Duplicate etc data needs to be partitioned and sorted, and if we use Auto partitioning then by default the above will happen(unless it is overridden in the job or disabled at Project level as you said).
Thanx and Regards,
ETL User
Post Reply