Partitioning

xinhuang66 · Post by **xinhuang66** » Mon Jul 01, 2013 12:27 am

I am very confused about partitioning in DS Parallel jobs recently.

Before, I feel key based partitioning is always needed in Join, remove duplicated, aggregator, merge stages, etc if job is running on multiple nodes. The reason is simply that key-based partition is the way to guarantee the same key values data falling into the same node and processsing. Otherwise, it will generate wrong result. It seems very true based on my test before.

Now, I see more and more jobs using auto partitioning (leave as default) in Join Stage, aggregator stages, etc.

Just called IBM Tech Support, IBM Tech Support says it depends, not sure what it depends on. IBM support also said join stage actually runs on single node regardless of what you do.

Anyone can help me clarify these questions.
1) Is key based partition needed in parallel jobs ?
2) Is Join stage running on single node ?

sendmkpk · Post by **sendmkpk** » Mon Jul 01, 2013 12:49 am

plz have a search in forum, this is extensively covered

jerome_rajan · Post by **jerome_rajan** » Mon Jul 01, 2013 2:49 am

1. key based partitioning is needed
2. It will run on a single node if 'YOU' want it to else it can run on multiple nodes.

ray.wurlod · Post by **ray.wurlod** » Mon Jul 01, 2013 3:58 am

Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be sub-optimal (for example other key-based algorithms than Hash exist).

The nodes on which a Join stage runs is determined first by the number of nodes in the configuration file and subsequently by certain stage properties, particularly the execution mode, but also whether it is constrained to execute in a node pool.

xinhuang66 · Post by **xinhuang66** » Mon Jul 01, 2013 5:12 pm

ray.wurlod wrote:Key-based partitioning IS needed in parallel jobs. A key-based partitioning algorithm will be selected under appropriate circumstances when (Auto) is left in place. This will always work but may be ...

This means to me, in general, explicit key based partition is not needed since parallel job will work anyway, there might be a few performance sacrifice. Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.

By the way, I have read most of posts in this regard, however, there is no any post clear discuss if this is needed or not....

I see some post said their aggregator generate duplicates while using auto partition, key based partitioning fixed the issue. However, based on what Ray said and what I see in many bluechip company warehouse system, aggregator with auto works.

jwiles · Post by **jwiles** » Mon Jul 01, 2013 9:42 pm

You are reasonably correct. Auto partitioning will guarantee that the partitioning method chosen (if necessary) will meet the needs of the stage requiring said partitioning. As part of job performance tuning, your job may be able to take advantage of specific key partitioning techniques (which auto partitioning won't provide) to increase the throughput of the job. Or, your business rules may have implemented some logic for which auto partitioning may create incorrect results. In both cases, you typically would explicitly partition on fewer columns that auto partitioning would. These are more advanced development topics which are discussed in IBM's RedBooks on parallel job design and performance tuning.

Proper data partitioning has essentially two goals: 1) Meeting the needs of the implemented business rules (most important) while 2) Distributing data as evenly/efficiently as possible to reach goal #1. Auto partitioning works great for probably 80-90% of jobs.

Aggregator produces duplicates typically when the developer does not understand the partitioning requirements of the aggregations and/or does not have a clear understanding of what partitioning is doing and has been instructed to use same partitioning (I have seen this before

).

Regards,

chandra.shekhar@tcs.com · Tue Jul 02, 2013 12:10 am

This topic has been discussed so much times and as xinhuang66 said it is still unclear to me too.
I'll take my project for an example, we have developed more than 1000 jobs here which contains atleast joins or aggregators etc.
For these stages, we have never used explicit partitioning methods.
We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results.

Aggregator produces duplicates typically when the ...

Even in Aggregators we never ever found any duplicate issue.

As per my knowledge, Key based partitioning is good to use but it is not mandatory for joins and aggregators.(please correct me if I am wrong)

So I request all the Gurus of datastage here that please clear the confusion.

ray.wurlod · Post by **ray.wurlod** » Tue Jul 02, 2013 1:14 am

It's mandatory for quite a few stages, including Join. Aggregator is not one of them, however incorrect results will obtain if you aggregate on incorrectly partitioned data. A deliberately constructed bad example produced a report on four nodes showing all 200 states of the USA!

As noted earlier, (Auto) will always produce correct results. Maybe just not quite as efficiently as a properly partitioned design.

SURA · Post by **SURA** » Tue Jul 02, 2013 5:13 am

ray.wurlod wrote:It's mandatory for quite a few stages, including Join

Hi Ray

For join ( massive volume ) if the user choose auto partition what will be the outcome !

Hi Chandra

Consider Rays comment. We had the same discussion long time back (Herewith the link: viewtopic.php?t=148321&highlight=Auto+partition ) .

See your comment !

SURA · Post by **SURA** » Tue Jul 02, 2013 5:21 am

xinhuang66 wrote:Explicit key based partitioning is only needed for better performance purpose... Correct me if I am wrong.

If you decided not to use Auto even for a single stage , then plan it accordingly and do your own partition approach from start to end, use same wherever you can may increase performance .

chandra.shekhar@tcs.com · Tue Jul 02, 2013 7:19 am

@Sura,
Thats what I said earlier, and I am still saying the same

It is not mandatory to use explicit partitioning methods and there's no harm too in doing that. it all depends upon the developer and the requirement.

ray.wurlod wrote:
It's mandatory for quite a few stages, including Join

For above comment and for your query, I said earlier too and I am saying the same again.
We have some jobs where we join atleast 1 Billion(100 Crores) records in 1 join or multiple joins. Still we never got incorrect output. We have always used auto partitioning and never used explicit partitioning for joins.

(Cant see Ray's comment as it's premium content)

srini.dw · Post by **srini.dw** » Tue Jul 02, 2013 7:53 am

1) Is key based partition needed in parallel jobs ?
Keyed partitioning is used when business rules (for example,
Remove Duplicates) or stage requirements (for example, Join) require
processing on groups of related records

2) Is Join stage running on single node ?
If the configuration file is having 2 nodes, how can Join stage runs on single node.

Thanks,

jwiles · Post by **jwiles** » Tue Jul 02, 2013 8:54 am

To all who are questioning: The "Auto" partitioning method simply instructs the parallel engine (i.e. Orchestrate) to choose an appropriate partition method at runtime, provided that Auto Partitioning has not been overridden or disabled by "Preserve Partitioning", $APT_NO_PART_INSERTION or so on. You can examine the Job Score and OSH in the job log (if you have set the required environment variables to display them) to see what partition method has been chosen. Auto partitioning effectively takes the partitioning decision/choice out of the developer's hands and places it in the engine's hands.

@SURA:

For join ( massive volume ) if the user choose auto partition what will be the outcome

In front of a Join, Auto will choose a keyed partition method (I believe it will always choose Hash) because that's what Join requires (see the product docs). It will partition on all keys defined in the Join options. A sort will be inserted if necessary (if sort insertion hasn't been disabled).

@srini.dw:

If the configuration file is having 2 nodes, how can Join stage runs on single node

Any parallel stage can be forced to run on a single node regardless of the number of nodes defined in the configuration file by setting it's Execution Mode to Sequential or by using Node Constraints (very rare). Documented in the product docs.

@chandra:

We have used Auto partitioning for more than 99.9% cases. And we never got any unexpected/incorrect results

Great! That means that you are using auto partitioning--and the options that affect it--correctly.

Key based partitioning is good to use but it is not mandatory for joins and aggregators

Incorrect! It is required for Join, Merge, Slowly Changing Dimension and maybe a handful of others, as is documented. Aggregator doesn't require keyed partitioning to run, but if you're using the Sort option in Aggregator and Auto partitioning, I bet that most, if not all, of the time Hash partitioning (i.e. keyed) is being chosen for you

Remember that even though the JOB is running in parallel, the individual instances of the operators (one in each node) are operating sequentially on a sequential stream of data (one in each node). That is the essence of partition parallelism...data divided (partitioned) into multiple streams being processed concurrently.

End of line.

Regards,

xinhuang66 · Post by **xinhuang66** » Tue Jul 02, 2013 6:58 pm

See, I read all posts here, still confused, even after I called IBM.

1) JOIN stage actually running on single node is from IBM support guy, he mentioned this is a bug and also suggested me use merge instead of join.

2) If aggregator doesn't need explicit key based partitioning, I don't understand why remove and join stage need it. DS can do the same for all of them, basically, read the keys specifiec in stage property, then hash partition since the developer's choice is auto (whatever) .

3) If explicit key based partitioning is mandatory for Join stage, how to explain Chandra's case, there are 1000+ jobs, with large data volume, all are using auto without any problem. Does any super guru here want to do an ETL health check for them.

4) My experience is I see a few large warehouses, their datastage is using auto for almost everything, at least always for Join. They run this for many years, Is it wrong ??

By the way, can someone come up with a simple example, two flat files join. with auto partition, it will generate wrong results.. ??

chandra.shekhar@tcs.com · Wed Jul 03, 2013 12:06 am

@James,
Thats what I am saying, if I use Auto partitioning method in Join, Merge etc stages then datastage by itself will use Hash partition.
Thats why we dont have to explicitly use Hash partitioning method.
The same is documented too, pasting here a part of the document

The data sets input to the Join stage must be key partitioned and sorted in ascending order. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done

.

For Stages like Join, Merge, Change Capture, Remove Duplicate etc data needs to be partitioned and sorted, and if we use Auto partitioning then by default the above will happen(unless it is overridden in the job or disabled at Project level as you said).