Sorting and sending to Entire Partition

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Sorting and sending to Entire Partition

Post by asorrell »

We have a situation where we are sorting a moderately large dataset on about 100 nodes. However, we want to send the resulting sorted results to a join with "Entire" for the left leg of a right outer join.

-YES- I know this sounds odd, but trust me, there are some extreme variants in the data, and this is getting us around them in a very performant manner (so far).

The question is: If we sort on 100 nodes, then send it straight to the join, with "Entire" selected, will DataStage keep all the records in sorted order as it consolidates the data from all 100 nodes and expands copies out to all the 100 nodes?

As an alternative, I know we can consolidate from the sort down to a single-threaded copy stage, which will keep the results in sorted order. Then sending that out to the join as "Entire" will keep the order in place.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

And the answer seems to be "no". We ran a test and it looks like it dropped about 5 million records, so I'm assuming that DataStage doesn't keep the records in sorted order as it goes from many (Hash) to many (entire).

Inserting a single-threaded copy stage works, but is slow (run time goes from 20 minutes to 1.2 hours).

/sigh....

The original problem is that the partitioned data is too "clumpy", causing some of the partitions get a massive amount of data and others get very little. We found that the "right" leg of the join could add an additional partitioning key to get a finer grain and spread the data out. However the left leg doesn't have an equivalent key, so we need to use "Entire" on that leg of the join to get the elements to match up.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

I can't speak to solving the issue as you describe it.
but if you did something robust to the keys yourself, like a SHA of the keys that you have on both sides, then it should redistribute clumpy keys evenly and make it perform solidly.

I don't think there is anything usable built in; you may need to download and compile a program to generate workable hashes. CRC won't cut it.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Couldn't you just do a join with it being set to sequential rather than parallel?
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

PaulVL wrote:Couldn't you just do a join with it being set to sequential rather than parallel?
Nope - its processing in excess of a billion rows, sometimes up to double digit billions!

Found a solution by the way. Used a partitioned Sort stage, fed it to a sequential Copy stage using "Sort-Merge" Collection (without "Sort" checked). Then sent it to the left side of the Join stage with "Entire".

Sort processed quickly due to being run on 100+ nodes in parallel. The Copy stage did a decent job of consolidating the data in order and feeding it out to the Join partitions as quickly as it comes in.

Having to single-thread the Copy slowed the job down a bit, but overall performance has improved dramatically.

All the changes have reduced runtimes by 95%. Job was running in hours and is now completing in minutes.

Some of the redesign I recommended also reduced memory / scratch dramatically as well. The job used to have a problem with Sorts running out of scratch. Now they almost all run in memory. That allows the job to use the "Entire" strategy to trade off increased memory usage for excellent data distribution and 10x+ throughput in the Join stage.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Nice!
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply