aggregator performance

dnat · Post by **dnat** » Thu Mar 12, 2009 4:00 am

I am using an aggregator stage just to count the number of rows from a particular link.

The design is like this

Seq file-->transformer-->aggregator-->seq file

Here i need the aggregator to count the total rows from transformer(the key is same for all the records), so it would pass through only one partition.

I am dealing with millions of records. Now we are doing development, but wanted to know how this would affect the performance. Or is there any other way to do this?

bkumar103 · Post by **bkumar103** » Thu Mar 12, 2009 4:52 am

Are you getting just count of the record in the output Sequential file.
If yes then you can use wc -l < inputfilename > outputfilename to get the count.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 12, 2009 3:34 pm

Do you really need the count as a separate operation? Why not calculate it as you are processing the actual file?

sjaladurgam · Post by **sjaladurgam** » Thu Mar 12, 2009 7:49 pm

Even I experienced same issue.But I tried keeping 2 Agg Stages and making first one with hash partitioning and second one with sequential that works fantastic.

Just try this.

Thanks.

sima79 · Post by **sima79** » Thu Mar 12, 2009 11:40 pm

One aggregator stage (execution mode parallel) to count the rows in parallel then another aggregator stage (execution mode sequential) to sum up the counts from each partition. No need to use hash partitioning, round robin in this case would be better.

dnat · Post by **dnat** » Fri Mar 13, 2009 12:49 am

sima and sjaladurgam

So, the two aggregator stages would not hinder the performance while doing for millions of records???. i am just worried since the data is very huge..anyway, thanks for your input.

Ray, i am not sure how we can calculate while actual processing, because anyways i have to calculate withouth the partitioning to get the total count.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Fri Mar 13, 2009 3:19 am

sjaladurgam wrote:...and second one with sequential ...

dnat · Post by **dnat** » Fri Mar 13, 2009 5:32 am

i made the first aggregaor as round robin and next as sequential mode. But the output is not correct.

The first aggregator shows as a collection type.

dnat · Post by **dnat** » Fri Mar 13, 2009 6:03 am

The first aggregator was showing as collection type because it was in sequential mode. I made it to parallel and partitioned in round robin. The second aggregator is in sequential mode. But it is not giving correct output.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Fri Mar 13, 2009 7:13 am

What do you mean by "not giving correct data"?

Unless you share the results, it is not even possible to guess what is happening differently.

DSXchange

aggregator performance

aggregator performance

Re: aggregator performance

Re: aggregator performance