Removing Duplicates ---Uncommon Scenario

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Removing Duplicates ---Uncommon Scenario

Post by kumar444 »

I do need suggestions to solve this uncommon requirement:

Code: Select all

  Id      Role                Duty
 100   *Athlete          Athlete
 100   *Athlete          Running
 100   *Athlete          Swimming
If the Ids are equal then i should check for the columns role and duty :

The rule is if ids are equal then output only the row whose role and duty are equal after trimming off the * from the role.
So in this case first row is the output.

Appreciate your invaluable suggestions
-------------------------------------------
Kumar
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Partition/Sort by ID.

If you're using IS 8.5, you can use the transformer looping functionality to solve this pretty easily.

If you're using 8.0 or 8.1, one option is to use a fork-join structure to add a rows-per-id count to the records, then use a transformer to drop those duplicates.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Post by kumar444 »

Thanks Jwiles.

Since we are using 8.1 here, I couldnt imagine how rows-per-id can solve this....Can you please elaborate on this?
-------------------------------------------
Kumar
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I think what jwiles is suggesting is that if rows-per-id is 1 then pass the row through but when rows-per-id is greater than 1 in the transformer then you do your logic to trim off the * and compare role with duty, if equal then pass the row through otherwise drop the row. You can do all that with stage variables and a constraint.

Eric
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

My reasoning behind using rows-per-id is based upon your requirements description:
The rule is if ids are equal then output only the row whose role and duty are equal after trimming off the * from the role.
To me, this means that if you have more than one row for an ID, keep only the row(s?) which have equal values for role and duty (minus the '*'). In order to do this, you need to calculate the number of rows present for each ID ("rows-per-id") as I don't see that value in the example data you provided. The logic qt-ky suggests would be appropriate for the transformer.

Another method that might work involves sorting on ID, then resorting based on the ID and a keychange field assigned in the first sort. The following transformer could use keychange in a similar manner as rowsperid. Either method will provide the results required.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Post by kumar444 »

Code: Select all

Input

Id      Role                Duty 
 100   *Athlete          Athlete 
 100   *Athlete          Running 
 100   *Athlete          Swimming 
 200   *Manager        Accounting
 200   *Manager        Manager
 300    Clerk             Supervision
 300    Clerk             Clerk

Output
Id      Role                Duty 
 100   *Athlete          Athlete 
 200   *Manager        Manager
 300    Clerk             Clerk




Sorry I understood your explanation but i dont know how its applicable to this case. I have extended my sample to make it more meaningful.

Thanks Jwiles and qt-ky.
-------------------------------------------
Kumar
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

There are three conditions I do not see represented in your sample data:

1) Only a single row is present for an ID (i.e. no duplicates on ID)
2) There are duplicates present for an ID, but none that meet the keep criteria you mention
3) There are duplicates present for an ID and multiple rows meet the keep criteria you mention

It may be that none of these conditions exist in your data, but as a seasoned developer and data processor, I can't safely assume that. You don't have those conditions listed in the business rules you have provided so far.

#1 as the situation where I see the rows-per-id being used (you have a new ID and there's only 1 row for it...how do you know that in a pre-8.5 transformer and what do you do?)

#2 Do you delete all rows, or keep one of them?

#3 Do you keep all rows that meet the criteria, or only one of them?

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

I am not sure how i understood your need. But based on my understanding it seems to be like straight forward.

In the TFM constrain pass it like link.Role = link.Duty. So that you will get only one row as an output for that link.

1) But i am not sure what you will do if you get duplicate values in Duty for the same id!

2) If the role is not matching with Duty then?

Let me know if i misunderstood your need.

DS User
kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Post by kumar444 »

@Jwiles,

I am lost with the requirement .Also, I am trying to run a query against the db but its taking forever . I need to split this query and design a DS job . I just got some sample data with me. Let me get the actual data from the query first , analyse it and will post the complete requirement . Sorry if i have wasted your time.

Thanks Jwiles with your suggestions given above i would calrify myself and come back here again in a while.
-------------------------------------------
Kumar
kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Post by kumar444 »

Thanks Sura . your understanding is right as per my requirement. As you said it looked so simple to me. I can first check the condition to see if role = duty and then remove duplicates based on id in the downstream. If role is not matching with duty i need to reject them. These are my assumptions as per the sample data and requirements i have got. I really think there is something missing in this. That is the reason i want to get the actual data from the whole query i have , analyse it and come back.
-------------------------------------------
Kumar
naveen_1985in
Participant
Posts: 18
Joined: Wed Dec 09, 2009 2:16 am

Solution for the Data Given by Kumar444

Post by naveen_1985in »

This can be accomplished using the following design

SqFile --> Sort Stage --> Transfo--> Datast(Target)

In the Sort Stage , just sort the data using ID column

In the Transformer use 3 stage Variable

Trim(Trim(name,"*","A"),'','A') ....Stage1(variable) to remove * and spaces

Trim(DSLink4.prof,'','A') ....Stage2(variable) to remove spaces

Compare(stage1,stage2) ..Compare if it is 0 then they are same

Add a consrtaint in the transformer stage and pass only those records to the target.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Your first requirement could be met by a Remove Duplicates stage preserving the First record from each group, assuming the data are sorted correctly.

You specified removal of the asterisk, but the output in your example preserved the asterisk. Removal is most easily effected by a Convert() function.

If you have to have a Transformer stage, then you can effect the removal of duplicates in that stage also, by setting up stage variables to detect a change in ID - one of these becomes your output constraint.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar444
Premium Member
Premium Member
Posts: 62
Joined: Wed Jan 12, 2011 10:01 am

Post by kumar444 »

Thanks Naveen and Ray..

The actual requirement has been changed and is as below.


Input

Code: Select all

id   name    bid_no   bname  inum      custname
100  *sam   AC100   sam    64546      Assoc
200   tom   AC345   tom     45645       Build
200   tom   AC345   tom     45666       Build
200   tom   AC345   tom      null          Build
500   jam   AC888   jam     98080       facet
300   can   BC399   ban      52344      counter
300   can   AC377   can      52344      counter
288  *rom  AC623   rom    12345      Adam inc
288   rom   AC623   fom     12345      Adam inc
656   gem   BC143   len      65434      kroger llc
656   gem   BC143   len      65434      kroger 



ouput:

id   name  bid_no   bname  inum      custname
100  *sam   AC100   sam    64546     Assoc
200   tom   AC345   tom    45645      Build
200   tom   AC345   tom    45666      Build
200   tom   AC345   tom    null          Build
500   jam   AC888   jam    98080      facet
300   can   AC377   can    52344       counter
288  *rom   AC623   rom    12345     Adam inc
656   gem   BC143   len    65434      kroger llc

First look for duplicates in ids.

CASE 1:I took records with id=100 and 500 becuase of its single occurence w.r.t to id.
CASE 2:Three records with id=200 and bid_no=AC345 because their ids and bid_nos are same.
CASE 3: Took one record with id=300 bacause its bid_no starts with AC (not BC).
CASE 4: Took one record with id=288 because its name=bname when compared after trimming the * from name.
CASE 5: Took one record with id=656 because both the records are same
and retained only one.

Please help me with suggestions to get these in datastage.
-------------------------------------------
Kumar
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Hi

Sort the data, use stage variable(s) for different scenario. Use a variable for flag too. You can achive it.

My personal request is not to expect more help from others. Split your needs (For example if you have more than one records and the values are differ then pass it / drop it) and try to apply one by one filter and get the result in the file. It will help you get more control to you.

If you can't get it by yourself, then post what condition you tried to achive what?

so that someone can correct that.

I understood what you need and i can post the total code what you need and it will work, but it will not help you!

Best of luck.

DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I'd probably use a double fork-join design, to get counts by key1 and by key1&key2. The rest of the logic will flow nicely downstream of that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply