Page 1 of 2

Removing Duplicates ---Uncommon Scenario

Posted: Sat Sep 10, 2011 7:01 pm
by kumar444
I do need suggestions to solve this uncommon requirement:

Code: Select all

  Id      Role                Duty
 100   *Athlete          Athlete
 100   *Athlete          Running
 100   *Athlete          Swimming
If the Ids are equal then i should check for the columns role and duty :

The rule is if ids are equal then output only the row whose role and duty are equal after trimming off the * from the role.
So in this case first row is the output.

Appreciate your invaluable suggestions

Posted: Sat Sep 10, 2011 8:08 pm
by jwiles
Partition/Sort by ID.

If you're using IS 8.5, you can use the transformer looping functionality to solve this pretty easily.

If you're using 8.0 or 8.1, one option is to use a fork-join structure to add a rows-per-id count to the records, then use a transformer to drop those duplicates.

Regards,

Posted: Sat Sep 10, 2011 8:52 pm
by kumar444
Thanks Jwiles.

Since we are using 8.1 here, I couldnt imagine how rows-per-id can solve this....Can you please elaborate on this?

Posted: Sat Sep 10, 2011 10:41 pm
by qt_ky
I think what jwiles is suggesting is that if rows-per-id is 1 then pass the row through but when rows-per-id is greater than 1 in the transformer then you do your logic to trim off the * and compare role with duty, if equal then pass the row through otherwise drop the row. You can do all that with stage variables and a constraint.

Eric

Posted: Sat Sep 10, 2011 11:33 pm
by jwiles
My reasoning behind using rows-per-id is based upon your requirements description:
The rule is if ids are equal then output only the row whose role and duty are equal after trimming off the * from the role.
To me, this means that if you have more than one row for an ID, keep only the row(s?) which have equal values for role and duty (minus the '*'). In order to do this, you need to calculate the number of rows present for each ID ("rows-per-id") as I don't see that value in the example data you provided. The logic qt-ky suggests would be appropriate for the transformer.

Another method that might work involves sorting on ID, then resorting based on the ID and a keychange field assigned in the first sort. The following transformer could use keychange in a similar manner as rowsperid. Either method will provide the results required.

Regards,

Posted: Mon Sep 12, 2011 12:24 pm
by kumar444

Code: Select all

Input

Id      Role                Duty 
 100   *Athlete          Athlete 
 100   *Athlete          Running 
 100   *Athlete          Swimming 
 200   *Manager        Accounting
 200   *Manager        Manager
 300    Clerk             Supervision
 300    Clerk             Clerk

Output
Id      Role                Duty 
 100   *Athlete          Athlete 
 200   *Manager        Manager
 300    Clerk             Clerk




Sorry I understood your explanation but i dont know how its applicable to this case. I have extended my sample to make it more meaningful.

Thanks Jwiles and qt-ky.

Posted: Mon Sep 12, 2011 1:02 pm
by jwiles
There are three conditions I do not see represented in your sample data:

1) Only a single row is present for an ID (i.e. no duplicates on ID)
2) There are duplicates present for an ID, but none that meet the keep criteria you mention
3) There are duplicates present for an ID and multiple rows meet the keep criteria you mention

It may be that none of these conditions exist in your data, but as a seasoned developer and data processor, I can't safely assume that. You don't have those conditions listed in the business rules you have provided so far.

#1 as the situation where I see the rows-per-id being used (you have a new ID and there's only 1 row for it...how do you know that in a pre-8.5 transformer and what do you do?)

#2 Do you delete all rows, or keep one of them?

#3 Do you keep all rows that meet the criteria, or only one of them?

Regards,

Posted: Mon Sep 12, 2011 6:29 pm
by SURA
I am not sure how i understood your need. But based on my understanding it seems to be like straight forward.

In the TFM constrain pass it like link.Role = link.Duty. So that you will get only one row as an output for that link.

1) But i am not sure what you will do if you get duplicate values in Duty for the same id!

2) If the role is not matching with Duty then?

Let me know if i misunderstood your need.

DS User

Posted: Tue Sep 13, 2011 8:57 pm
by kumar444
@Jwiles,

I am lost with the requirement .Also, I am trying to run a query against the db but its taking forever . I need to split this query and design a DS job . I just got some sample data with me. Let me get the actual data from the query first , analyse it and will post the complete requirement . Sorry if i have wasted your time.

Thanks Jwiles with your suggestions given above i would calrify myself and come back here again in a while.

Posted: Tue Sep 13, 2011 9:08 pm
by kumar444
Thanks Sura . your understanding is right as per my requirement. As you said it looked so simple to me. I can first check the condition to see if role = duty and then remove duplicates based on id in the downstream. If role is not matching with duty i need to reject them. These are my assumptions as per the sample data and requirements i have got. I really think there is something missing in this. That is the reason i want to get the actual data from the whole query i have , analyse it and come back.

Solution for the Data Given by Kumar444

Posted: Tue Sep 13, 2011 11:19 pm
by naveen_1985in
This can be accomplished using the following design

SqFile --> Sort Stage --> Transfo--> Datast(Target)

In the Sort Stage , just sort the data using ID column

In the Transformer use 3 stage Variable

Trim(Trim(name,"*","A"),'','A') ....Stage1(variable) to remove * and spaces

Trim(DSLink4.prof,'','A') ....Stage2(variable) to remove spaces

Compare(stage1,stage2) ..Compare if it is 0 then they are same

Add a consrtaint in the transformer stage and pass only those records to the target.

Posted: Wed Sep 14, 2011 12:43 am
by ray.wurlod
Your first requirement could be met by a Remove Duplicates stage preserving the First record from each group, assuming the data are sorted correctly.

You specified removal of the asterisk, but the output in your example preserved the asterisk. Removal is most easily effected by a Convert() function.

If you have to have a Transformer stage, then you can effect the removal of duplicates in that stage also, by setting up stage variables to detect a change in ID - one of these becomes your output constraint.

Posted: Tue Sep 20, 2011 4:08 pm
by kumar444
Thanks Naveen and Ray..

The actual requirement has been changed and is as below.


Input

Code: Select all

id   name    bid_no   bname  inum      custname
100  *sam   AC100   sam    64546      Assoc
200   tom   AC345   tom     45645       Build
200   tom   AC345   tom     45666       Build
200   tom   AC345   tom      null          Build
500   jam   AC888   jam     98080       facet
300   can   BC399   ban      52344      counter
300   can   AC377   can      52344      counter
288  *rom  AC623   rom    12345      Adam inc
288   rom   AC623   fom     12345      Adam inc
656   gem   BC143   len      65434      kroger llc
656   gem   BC143   len      65434      kroger 



ouput:

id   name  bid_no   bname  inum      custname
100  *sam   AC100   sam    64546     Assoc
200   tom   AC345   tom    45645      Build
200   tom   AC345   tom    45666      Build
200   tom   AC345   tom    null          Build
500   jam   AC888   jam    98080      facet
300   can   AC377   can    52344       counter
288  *rom   AC623   rom    12345     Adam inc
656   gem   BC143   len    65434      kroger llc

First look for duplicates in ids.

CASE 1:I took records with id=100 and 500 becuase of its single occurence w.r.t to id.
CASE 2:Three records with id=200 and bid_no=AC345 because their ids and bid_nos are same.
CASE 3: Took one record with id=300 bacause its bid_no starts with AC (not BC).
CASE 4: Took one record with id=288 because its name=bname when compared after trimming the * from name.
CASE 5: Took one record with id=656 because both the records are same
and retained only one.

Please help me with suggestions to get these in datastage.

Posted: Tue Sep 20, 2011 5:56 pm
by SURA
Hi

Sort the data, use stage variable(s) for different scenario. Use a variable for flag too. You can achive it.

My personal request is not to expect more help from others. Split your needs (For example if you have more than one records and the values are differ then pass it / drop it) and try to apply one by one filter and get the result in the file. It will help you get more control to you.

If you can't get it by yourself, then post what condition you tried to achive what?

so that someone can correct that.

I understood what you need and i can post the total code what you need and it will work, but it will not help you!

Best of luck.

DS User

Posted: Tue Sep 20, 2011 7:35 pm
by ray.wurlod
I'd probably use a double fork-join design, to get counts by key1 and by key1&key2. The rest of the logic will flow nicely downstream of that.