Aggregator performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Tigrou
Participant
Posts: 3
Joined: Mon Jan 26, 2009 10:05 am

Aggregator performance

Post by Tigrou »

Hi everyone,

First of all, thanks to all of you for this great forum/ :D

Maybe you could help me concerning a "problem" i meet in one of my job.

The fact is that i extract from an unsorted flat file around 200 millions rows, and i want to "group by" them with an aggregator stage (hash method on keys).

It works well but it takes around 1h30min, on four node.
(I don't know the architecture of unix server).

My question is : do you think it would be possible to increase performance and execution time, and how ?
I already tried to sort input data before the aggregator, but saving of time is unimportant.

Your help will be great. :wink:

Thanks a lot
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.

But 200 million rows in 90 minutes still represents nearly 40,000 rows per second, which is not too bad. That's a large volume of data you have.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Tigrou
Participant
Posts: 3
Joined: Mon Jan 26, 2009 10:05 am

Post by Tigrou »

ray.wurlod wrote:There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memo ...
Thanks for your answer.
So i'll try this and tell you about... :wink:

But as far as you are concerned, do you think elapsed time i get (1h30min) may be "normal" for such a quantity of rows, or is it too long ?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Without knowing more about your servers, particularly their performance data (CPU speed and the like) it's really impossible to say. But nearly 40,000 rows per second sustained over 90 minutes would keep most people here happy.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Tigrou
Participant
Posts: 3
Joined: Mon Jan 26, 2009 10:05 am

Post by Tigrou »

Thanks a lot for your help.
I'll ckeck and will tell you if my matter is resolved.

Kind regards.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Look at your cpu utilization - if it's at 100% then it means that you are cpu bound. If it isn't, there's a chance you're paging/swapping memory/disk and could improve performance. Sorting first will help aggregation, but you have to try both ways to see which way the runtime is better.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply