Hi everyone,
First of all, thanks to all of you for this great forum/ :D
Maybe you could help me concerning a "problem" i meet in one of my job.
The fact is that i extract from an unsorted flat file around 200 millions rows, and i want to "group by" them with an aggregator stage (hash method on keys).
It works well but it takes around 1h30min, on four node.
(I don't know the architecture of unix server).
My question is : do you think it would be possible to increase performance and execution time, and how ?
I already tried to sort input data before the aggregator, but saving of time is unimportant.
Your help will be great.
Thanks a lot
Aggregator performance
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memory for sorting as you can afford. Don't forget to partition and sort on the grouping keys - do this as far upstream in your job design as possible.
But 200 million rows in 90 minutes still represents nearly 40,000 rows per second, which is not too bad. That's a large volume of data you have.
But 200 million rows in 90 minutes still represents nearly 40,000 rows per second, which is not too bad. That's a large volume of data you have.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks for your answer.ray.wurlod wrote:There's nothing you can tune with Hash aggregation method. Sort will give better performance (measured as elapsed time) but does require sorted input. Use an explicit Sort stage and use as much memo ...
So i'll try this and tell you about...
But as far as you are concerned, do you think elapsed time i get (1h30min) may be "normal" for such a quantity of rows, or is it too long ?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Without knowing more about your servers, particularly their performance data (CPU speed and the like) it's really impossible to say. But nearly 40,000 rows per second sustained over 90 minutes would keep most people here happy.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Look at your cpu utilization - if it's at 100% then it means that you are cpu bound. If it isn't, there's a chance you're paging/swapping memory/disk and could improve performance. Sorting first will help aggregation, but you have to try both ways to see which way the runtime is better.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle