Aggregator performance

Tigrou · Post by **Tigrou** » Fri Apr 10, 2009 12:14 pm

Hi everyone,

First of all, thanks to all of you for this great forum/ :D

Maybe you could help me concerning a "problem" i meet in one of my job.

The fact is that i extract from an unsorted flat file around 200 millions rows, and i want to "group by" them with an aggregator stage (hash method on keys).

It works well but it takes around 1h30min, on four node.
(I don't know the architecture of unix server).

My question is : do you think it would be possible to increase performance and execution time, and how ?
I already tried to sort input data before the aggregator, but saving of time is unimportant.

Your help will be great.

Thanks a lot

ray.wurlod · Post by **ray.wurlod** » Fri Apr 10, 2009 12:33 pm

There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 10, 2009 12:34 pm

There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.

But 200 million rows in 90 minutes still represents nearly 40,000 rows per second, which is not too bad. That's a large volume of data you have.

Tigrou · Post by **Tigrou** » Fri Apr 10, 2009 1:18 pm

ray.wurlod wrote:There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memo ...

Thanks for your answer.
So i'll try this and tell you about...

But as far as you are concerned, do you think elapsed time i get (1h30min) may be "normal" for such a quantity of rows, or is it too long ?

ray.wurlod · Post by **ray.wurlod** » Sat Apr 11, 2009 4:08 pm

Without knowing more about your servers, particularly their performance data (CPU speed and the like) it's really impossible to say. But nearly 40,000 rows per second sustained over 90 minutes would keep most people here happy.

Tigrou · Post by **Tigrou** » Tue Apr 14, 2009 11:44 am

Thanks a lot for your help.
I'll ckeck and will tell you if my matter is resolved.

Kind regards.

kcbland · Post by **kcbland** » Tue Apr 14, 2009 12:57 pm

Look at your cpu utilization - if it's at 100% then it means that you are cpu bound. If it isn't, there's a chance you're paging/swapping memory/disk and could improve performance. Sorting first will help aggregation, but you have to try both ways to see which way the runtime is better.