Counts and sums in Datastage with out 'Aggregator' stage

anu123 · Post by **anu123** » Thu Jan 18, 2007 4:04 pm

Hi all

The scenario is as below.

key1,key2,desc
-------------------
a,b,emergency_room
a,b,emergency_room
a,b,emergency_room
m,n,emergency_room
m,n,emergency_room
x,y,emergency_room

output

key1,key2,count
----------------
a,b,3
m,n,2
x,y,1

Can this be done in datastage?
I was asked not use Aggregator stage and and to use 'group by' in my sql.
as the data vol. is very huge.

DSguru2B · Post by **DSguru2B** » Thu Jan 18, 2007 4:07 pm

I did not get that. You can use group by or you cannot. You can do the counts in transformer using stage varialbes. Discussed zillions of times. Search is your best friend

Like this for example.

kumar_s · Post by **kumar_s** » Thu Jan 18, 2007 5:09 pm

If you need to output in this format,
a,b,3
m,n,2
x,y,1
The better way is to group by from you soruce SQL. If key1 and key2 are indexed, SQL should me much more effecient.

anu123 · Post by **anu123** » Thu Jan 18, 2007 9:16 pm

thanks DSguru2B and Kumar.

In real scenario, I have 8 columns to group by. And the data is like 500 mil. per each run.

I have been trying with stage variables. But wondering how to retain previous key col values to check with incoming val. every time.

I created 2 stage variables like svPrevRow and svCount with initial values as 0

each incoming row is being checked against svPrevRow and if they are equal then increase the count.

do it till you get a different set of keys. that time we have to write the svPrevRow and svCount as output and do the same for the new set of keys.

But I am doing some thing wrong here.. I am not getting correct out put. could some one help me out with this...please.

thanks a lot for your valuable time.

kumar_s wrote:If you need to output in this format,
a,b,3
m,n,2
x,y,1
The better way is to group by from you soruce SQL. If key1 and key2 are indexed, SQL should me much more effecient.

DSguru2B · Post by **DSguru2B** » Thu Jan 18, 2007 9:51 pm

Did you check the embeded post in my previous reply? Vincent has given the complete code on how to get it done.

anu123 · Post by **anu123** » Fri Jan 19, 2007 8:27 am

DSguru2B wrote:Did you check the embeded post in my previous reply? Vincent has given the complete code on how to get it done.

thanks Guru. I checked the embedded post and tried to implement that logic. But in my case, I should not be writing each row to output but count them in a loop and write to output along with count when you get new set of keys.

I used same logic. But i got all the rows in output with a count '1' against each row.

DSguru2B · Post by **DSguru2B** » Fri Jan 19, 2007 8:39 am

There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.

anu123 · Post by **anu123** » Fri Jan 19, 2007 9:11 am

thanks again for your time guru..

Input

key1,key2,key3,desc
-------------------
100,200,300,er
100,200,300,er
100,200,300,er
111,222,333,er
111,222,333,er

Output should be

key1,key2,key3,count
--------------------
100,200,300,3
111,222,333,2

Input is coming from a table.There are a few other keys along with above three (i.e 100,200.300 ) to make each row unique in the source table.But i need the count of each set of(i.e 100,200.300 ) records where desc = 'er'. Just assume we have 'er' as desc for all incoming rows.

and my stage variables are as below...

sv_name sv_derivation
------------------------------------------------------------------------------------
svCnt --- if input.key1:input

input.key3 <> svPrevRow Then 1 Else svCnt + 1

svPrevRow -- input.key1:input

input.key3

DSguru2B wrote:There must be something you are doing wrong. Do us a favour and post a few lines of your actual file with exact number of columns and post your stage variable derivations. Lets diagnose where your slipping away, shall we.

DSguru2B · Post by **DSguru2B** » Fri Jan 19, 2007 9:35 am

What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look like

Code: Select all

key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1 
100,200,300,er,2 
100,200,300,er,3 
111,222,333,er,1 
111,222,333,er,2

You need to pass it through remove duplicate stage to retain the Last duplicate.

Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.

anu123 · Post by **anu123** » Fri Jan 19, 2007 9:51 am

I am getting an output as below..

key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1
100,200,300,er,1
100,200,300,er,1
111,222,333,er,1
111,222,333,er,1

They had bad experiences with 'Agg' stage in 7.5.1 PX especially when dealing with 500 millions (or more) of data in each run.I was told the output were not accurate OR not consistant.

No way of using 'Agg' stage here.

DSguru2B wrote:What you are doing is correct but you need to add another step. Currently if you feed your input (the one you mentioned in your prev. post) to the above logic, your output will look like
Code: Select all
key1,key2,key3,desc, Count
-------------------------
100,200,300,er,1 
100,200,300,er,2 
100,200,300,er,3 
111,222,333,er,1 
111,222,333,er,2 
You need to pass it through remove duplicate stage to retain the Last duplicate.

Then again you will have to process in sequential mode or partition on keys. Also make sure your input is sorted, very important. The easiest way, use the aggregator. If I were you, i would revert back to whom ever put that limit on me and demand an explanation.

Madhav_M · Post by **Madhav_M** » Fri Jan 19, 2007 10:03 am

I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav

anu123 · Post by **anu123** » Fri Jan 19, 2007 12:25 pm

Thank you.

Is there any way that we can SUM the $amounts too in transformer.for example in the same above example we have 5th col. $amt. We need to SUM($amt) from keys '100,200,300' where Desc = 'er'.

Madhav_M wrote:I would also suggest to go with Aggregator stage. Usage staging variable will almost takes same time.
If no other go you can go ahead with DSguru suggestion, partition on key1, key2 and key3 and sort in the input link of the transformer. Introduce a RDUP stage after transformer to retain the last record.
Cheers - Madhav

DSguru2B · Post by **DSguru2B** » Fri Jan 19, 2007 1:31 pm

Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be

Code: Select all

if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount

Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.

anu123 · Post by **anu123** » Fri Jan 19, 2007 2:34 pm

Guru:
could you please check my previous to previous post. I am not getting the output as you mentioned.

thanks in advance.

DSguru2B wrote:Yes you can. Add another stage variable say svSum. Put it after svCnt and before svPrevRow. Its derivation will be
Code: Select all
if input.key1:input:key2:input.key3 <> svPrevRow Then svSum Else svCnt + in.Amount
Then again, if your running in sequential mode it will work, otherwise you need to partition on your keys.

DSguru2B · Post by **DSguru2B** » Fri Jan 19, 2007 2:40 pm

In the transformer, go to the stage properties, what partitioning are you providing?