Counts and sums in Datastage with out 'Aggregator' stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Counts and sums in Datastage with out 'Aggregator' stage

Post by anu123 »

Hi all

The scenario is as below.


key1,key2,desc
-------------------
a,b,emergency_room
a,b,emergency_room
a,b,emergency_room
m,n,emergency_room
m,n,emergency_room
x,y,emergency_room

output

key1,key2,count
----------------
a,b,3
m,n,2
x,y,1


Can this be done in datastage?
I was asked not use Aggregator stage and and to use 'group by' in my sql.
as the data vol. is very huge.
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I did not get that. You can use group by or you cannot. You can do the counts in transformer using stage varialbes. Discussed zillions of times. Search is your best friend :wink:
Like this for example.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If you need to output in this format,
a,b,3
m,n,2
x,y,1
The better way is to group by from you soruce SQL. If key1 and key2 are indexed, SQL should me much more effecient.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

thanks DSguru2B and Kumar.

In real scenario, I have 8 columns to group by. And the data is like 500 mil. per each run.

I have been trying with stage variables. But wondering how to retain previous key col values to check with incoming val. every time.

I created 2 stage variables like svPrevRow and svCount with initial values as 0

each incoming row is being checked against svPrevRow and if they are equal then increase the count.

do it till you get a different set of keys. that time we have to write the svPrevRow and svCount as output and do the same for the new set of keys.

But I am doing some thing wrong here.. I am not getting correct out put. could some one help me out with this...please.

thanks a lot for your valuable time.


kumar_s wrote:If you need to output in this format,
a,b,3
m,n,2
x,y,1
The better way is to group by from you soruce SQL. If key1 and key2 are indexed, SQL should me much more effecient.
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Did you check the embeded post in my previous reply? Vincent has given the complete code on how to get it done.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

DSguru2B wrote:Did you check the embeded post in my previous reply? Vincent has given the complete code on how to get it done.
thanks Guru. I checked the embedded post and tried to implement that logic. But in my case, I should not be writing each row to output but count them in a loop and write to output along with count when you get new set of keys.

I used same logic. But i got all the rows in output with a count '1' against each row.
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

thanks again for your time guru..

Input

key1,key2,key3,desc
-------------------
100,200,300,er
100,200,300,er
100,200,300,er
111,222,333,er
111,222,333,er


Output should be

key1,key2,key3,count
--------------------
100,200,300,3
111,222,333,2

Input is coming from a table.There are a few other keys along with above three (i.e 100,200.300 ) to make each row unique in the source table.But i need the count of each set of(i.e 100,200.300 ) records where desc = 'er'. Just assume we have 'er' as desc for all incoming rows.

and my stage variables are as below...

sv_name sv_derivation
------------------------------------------------------------------------------------
svCnt --- if input.key1:input:key2:input.key3 <> svPrevRow Then 1 Else svCnt + 1

svPrevRow -- input.key1:input:key2:input.key3






DSguru2B wrote:There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look like

Code: Select all

key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1 
100,200,300,er,2 
100,200,300,er,3 
111,222,333,er,1 
111,222,333,er,2 
You need to pass it through remove duplicate stage to retain the Last duplicate.

Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

I am getting an output as below..

key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1
100,200,300,er,1
100,200,300,er,1
111,222,333,er,1
111,222,333,er,1

They had bad experiences with 'Agg' stage in 7.5.1 PX especially when dealing with 500 millions (or more) of data in each run.I was told the output were not accurate OR not consistant.

No way of using 'Agg' stage here.


DSguru2B wrote:What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look like

Code: Select all

key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1 
100,200,300,er,2 
100,200,300,er,3 
111,222,333,er,1 
111,222,333,er,2 
You need to pass it through remove duplicate stage to retain the Last duplicate.

Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.
Thank you,
Anu
Madhav_M
Participant
Posts: 43
Joined: Sat Jul 10, 2004 5:47 am

Post by Madhav_M »

I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

Thank you.

Is there any way that we can SUM the $amounts too in transformer.for example in the same above example we have 5th col. $amt. We need to SUM($amt) from keys '100,200,300' where Desc = 'er'.


Madhav_M wrote:I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be

Code: Select all

if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount
Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
anu123
Premium Member
Premium Member
Posts: 143
Joined: Sun Feb 05, 2006 1:05 pm
Location: Columbus, OH, USA

Post by anu123 »

Guru:
could you please check my previous to previous post. I am not getting the output as you mentioned.

thanks in advance.
DSguru2B wrote:Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be

Code: Select all

if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount
Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.
Thank you,
Anu
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

In the transformer, go to the stage properties, what partitioning are you providing?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply