Partitioning
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 161
- Joined: Wed Aug 02, 2006 4:30 am
Partitioning
I am very confused about partitioning in DS Parallel jobs recently.
Before, I feel key based partitioning is always needed in Join, remove duplicated, aggregator, merge stages, etc if job is running on multiple nodes. The reason is simply that key-based partition is the way to guarantee the same key values data falling into the same node and processsing. Otherwise, it will generate wrong result. It seems very true based on my test before.
Now, I see more and more jobs using auto partitioning (leave as default) in Join Stage, aggregator stages, etc.
Just called IBM Tech Support, IBM Tech Support says it depends, not sure what it depends on. IBM support also said join stage actually runs on single node regardless of what you do.
Anyone can help me clarify these questions.
1) Is key based partition needed in parallel jobs ?
2) Is Join stage running on single node ?
Before, I feel key based partitioning is always needed in Join, remove duplicated, aggregator, merge stages, etc if job is running on multiple nodes. The reason is simply that key-based partition is the way to guarantee the same key values data falling into the same node and processsing. Otherwise, it will generate wrong result. It seems very true based on my test before.
Now, I see more and more jobs using auto partitioning (leave as default) in Join Stage, aggregator stages, etc.
Just called IBM Tech Support, IBM Tech Support says it depends, not sure what it depends on. IBM support also said join stage actually runs on single node regardless of what you do.
Anyone can help me clarify these questions.
1) Is key based partition needed in parallel jobs ?
2) Is Join stage running on single node ?
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
1. key based partitioning is needed
2. It will run on a single node if 'YOU' want it to else it can run on multiple nodes.
2. It will run on a single node if 'YOU' want it to else it can run on multiple nodes.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be sub-optimal (for example other key-based algorithms than Hash exist).
The nodes on which a Join stage runs is determined first by the number of nodes in the configuration file and subsequently by certain stage properties, particularly the execution mode, but also whether it is constrained to execute in a node pool.
The nodes on which a Join stage runs is determined first by the number of nodes in the configuration file and subsequently by certain stage properties, particularly the execution mode, but also whether it is constrained to execute in a node pool.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 161
- Joined: Wed Aug 02, 2006 4:30 am
This means to me, in general, explicit key based partition is not needed since parallel job will work anyway, there might be a few performance sacrifice. Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.ray.wurlod wrote:Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be ...
By the way, I have read most of posts in this regard, however, there is no any post clear discuss if this is needed or not....
I see some post said their aggregator generate duplicates while using auto partition, key based partitioning fixed the issue. However, based on what Ray said and what I see in many bluechip company warehouse system, aggregator with auto works.
You are reasonably correct. Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning. As part of job performance tuning, your job may be able to take advantage of specific key partitioning techniques (which auto partitioning won't provide) to increase the throughput of the job. Or, your business rules may have implemented some logic for which auto partitioning may create incorrect results. In both cases, you typically would explicitly partition on fewer columns that auto partitioning would. These are more advanced development topics which are discussed in IBM's RedBooks on parallel job design and performance tuning.
Proper data partitioning has essentially two goals: 1) Meeting the needs of the implemented business rules (most important) while 2) Distributing data as evenly/efficiently as possible to reach goal #1. Auto partitioning works great for probably 80-90% of jobs.
Aggregator produces duplicates typically when the developer does not understand the partitioning requirements of the aggregations and/or does not have a clear understanding of what partitioning is doing and has been instructed to use same partitioning (I have seen this before ).
Regards,
Proper data partitioning has essentially two goals: 1) Meeting the needs of the implemented business rules (most important) while 2) Distributing data as evenly/efficiently as possible to reach goal #1. Auto partitioning works great for probably 80-90% of jobs.
Aggregator produces duplicates typically when the developer does not understand the partitioning requirements of the aggregations and/or does not have a clear understanding of what partitioning is doing and has been instructed to use same partitioning (I have seen this before ).
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
This topic has been discussed so much times and as xinhuang66 said it is still unclear to me too.
I'll take my project for an example, we have developed more than 1000 jobs here which contains atleast joins or aggregators etc.
For these stages, we have never used explicit partitioning methods.
We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results.
As per my knowledge, Key based partitioning is good to use but it is not mandatory for joins and aggregators.(please correct me if I am wrong)
So I request all the Gurus of datastage here that please clear the confusion.
I'll take my project for an example, we have developed more than 1000 jobs here which contains atleast joins or aggregators etc.
For these stages, we have never used explicit partitioning methods.
We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results.
Even in Aggregators we never ever found any duplicate issue.Aggregator produces duplicates typically when the ...
As per my knowledge, Key based partitioning is good to use but it is not mandatory for joins and aggregators.(please correct me if I am wrong)
So I request all the Gurus of datastage here that please clear the confusion.
Thanx and Regards,
ETL User
ETL User
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It's mandatory for quite a few stages, including Join. Aggregator is not one of them, however incorrect results will obtain if you aggregate on incorrectly partitioned data. A deliberately constructed bad example produced a report on four nodes showing all 200 states of the USA!
As noted earlier, (Auto) will always produce correct results. Maybe just not quite as efficiently as a properly partitioned design.
As noted earlier, (Auto) will always produce correct results. Maybe just not quite as efficiently as a properly partitioned design.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi Rayray.wurlod wrote:It's mandatory for quite a few stages, including Join
For join ( massive volume ) if the user choose auto partition what will be the outcome !
Hi Chandra
Consider Rays comment. We had the same discussion long time back (Herewith the link: viewtopic.php?t=148321&highlight=Auto+partition ) .
See your comment !
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
If you decided not to use Auto even for a single stage , then plan it accordingly and do your own partition approach from start to end, use same wherever you can may increase performance .xinhuang66 wrote:Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
@Sura,
Thats what I said earlier, and I am still saying the same
It is not mandatory to use explicit partitioning methods and there's no harm too in doing that. it all depends upon the developer and the requirement.
We have some jobs where we join atleast 1 Billion(100 Crores) records in 1 join or multiple joins. Still we never got incorrect output. We have always used auto partitioning and never used explicit partitioning for joins.
(Cant see Ray's comment as it's premium content)
Thats what I said earlier, and I am still saying the same
It is not mandatory to use explicit partitioning methods and there's no harm too in doing that. it all depends upon the developer and the requirement.
For above comment and for your query, I said earlier too and I am saying the same again.ray.wurlod wrote:
It's mandatory for quite a few stages, including Join
We have some jobs where we join atleast 1 Billion(100 Crores) records in 1 join or multiple joins. Still we never got incorrect output. We have always used auto partitioning and never used explicit partitioning for joins.
(Cant see Ray's comment as it's premium content)
Thanx and Regards,
ETL User
ETL User
1) Is key based partition needed in parallel jobs ?
Keyed partitioning is used when business rules (for example,
Remove Duplicates) or stage requirements (for example, Join) require
processing on groups of related records
2) Is Join stage running on single node ?
If the configuration file is having 2 nodes, how can Join stage runs on single node.
Thanks,
Keyed partitioning is used when business rules (for example,
Remove Duplicates) or stage requirements (for example, Join) require
processing on groups of related records
2) Is Join stage running on single node ?
If the configuration file is having 2 nodes, how can Join stage runs on single node.
Thanks,
To all who are questioning: The "Auto" partitioning method simply instructs the parallel engine (i.e. Orchestrate) to choose an appropriate partition method at runtime, provided that Auto Partitioning has not been overridden or disabled by "Preserve Partitioning", $APT_NO_PART_INSERTION or so on. You can examine the Job Score and OSH in the job log (if you have set the required environment variables to display them) to see what partition method has been chosen. Auto partitioning effectively takes the partitioning decision/choice out of the developer's hands and places it in the engine's hands.
@SURA:
@srini.dw:
@chandra:
Remember that even though the JOB is running in parallel, the individual instances of the operators (one in each node) are operating sequentially on a sequential stream of data (one in each node). That is the essence of partition parallelism...data divided (partitioned) into multiple streams being processed concurrently.
End of line.
Regards,
@SURA:
In front of a Join, Auto will choose a keyed partition method (I believe it will always choose Hash) because that's what Join requires (see the product docs). It will partition on all keys defined in the Join options. A sort will be inserted if necessary (if sort insertion hasn't been disabled).For join ( massive volume ) if the user choose auto partition what will be the outcome
@srini.dw:
Any parallel stage can be forced to run on a single node regardless of the number of nodes defined in the configuration file by setting it's Execution Mode to Sequential or by using Node Constraints (very rare). Documented in the product docs.If the configuration file is having 2 nodes, how can Join stage runs on single node
@chandra:
Great! That means that you are using auto partitioning--and the options that affect it--correctly.We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results
Incorrect! It is required for Join, Merge, Slowly Changing Dimension and maybe a handful of others, as is documented. Aggregator doesn't require keyed partitioning to run, but if you're using the Sort option in Aggregator and Auto partitioning, I bet that most, if not all, of the time Hash partitioning (i.e. keyed) is being chosen for youKey based partitioning is good to use but it is not mandatory for joins and aggregators
Remember that even though the JOB is running in parallel, the individual instances of the operators (one in each node) are operating sequentially on a sequential stream of data (one in each node). That is the essence of partition parallelism...data divided (partitioned) into multiple streams being processed concurrently.
End of line.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 161
- Joined: Wed Aug 02, 2006 4:30 am
See, I read all posts here, still confused, even after I called IBM.
1) JOIN stage actually running on single node is from IBM support guy, he mentioned this is a bug and also suggested me use merge instead of join.
2) If aggregator doesn't need explicit key based partitioning, I don't understand why remove and join stage need it. DS can do the same for all of them, basically, read the keys specifiec in stage property, then hash partition since the developer's choice is auto (whatever) .
3) If explicit key based partitioning is mandatory for Join stage, how to explain Chandra's case, there are 1000+ jobs, with large data volume, all are using auto without any problem. Does any super guru here want to do an ETL health check for them.
4) My experience is I see a few large warehouses, their datastage is using auto for almost everything, at least always for Join. They run this for many years, Is it wrong ??
By the way, can someone come up with a simple example, two flat files join. with auto partition, it will generate wrong results.. ??
1) JOIN stage actually running on single node is from IBM support guy, he mentioned this is a bug and also suggested me use merge instead of join.
2) If aggregator doesn't need explicit key based partitioning, I don't understand why remove and join stage need it. DS can do the same for all of them, basically, read the keys specifiec in stage property, then hash partition since the developer's choice is auto (whatever) .
3) If explicit key based partitioning is mandatory for Join stage, how to explain Chandra's case, there are 1000+ jobs, with large data volume, all are using auto without any problem. Does any super guru here want to do an ETL health check for them.
4) My experience is I see a few large warehouses, their datastage is using auto for almost everything, at least always for Join. They run this for many years, Is it wrong ??
By the way, can someone come up with a simple example, two flat files join. with auto partition, it will generate wrong results.. ??
-
- Premium Member
- Posts: 353
- Joined: Mon Jan 17, 2011 5:03 am
- Location: Mumbai, India
@James,
Thats what I am saying, if I use Auto partitioning method in Join, Merge etc stages then datastage by itself will use Hash partition.
Thats why we dont have to explicitly use Hash partitioning method.
The same is documented too, pasting here a part of the document
For Stages like Join, Merge, Change Capture, Remove Duplicate etc data needs to be partitioned and sorted, and if we use Auto partitioning then by default the above will happen(unless it is overridden in the job or disabled at Project level as you said).
Thats what I am saying, if I use Auto partitioning method in Join, Merge etc stages then datastage by itself will use Hash partition.
Thats why we dont have to explicitly use Hash partitioning method.
The same is documented too, pasting here a part of the document
.The data sets input to the Join stage must be key partitioned and sorted in ascending order. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done
For Stages like Join, Merge, Change Capture, Remove Duplicate etc data needs to be partitioned and sorted, and if we use Auto partitioning then by default the above will happen(unless it is overridden in the job or disabled at Project level as you said).
Thanx and Regards,
ETL User
ETL User