Removing Duplicates : unique situation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Removing Duplicates : unique situation

Post by abc123 »

I have gone through all relevant posts. I have data as follows:

Key: Col2, Col3, Col4

Col1 Col2 Col3 Col4 Col_Date
1,10,A,100,02/10/2011
2,20,A,200,03/10/2011
3,20,A,200,04/10/2011
4,30,B,300,05/10/2011
5,20,A,200,06/10/2011

As you can see, lines 2, 3 and 5 are duplicates in Col2, Col3 and Col4. I would like to remove line 3 which is a duplicate of 2 but would like to keep line 5 since it has appeared later (after line 4 which has a different key) eventhough the 3 key values are the same as lines 2 and 3.

1)I tried the Sort stage with the KeyChange column to True. However, this sorts first which defeats my purpose.

2)I also tried 'Don't Sort, Previously Sorted' for these 3 columns in the Sort stage but that gives an error saying 'Record 4 not sorted'.

So what I need is the KeyChange feature of the Sort stage without doing the sort. Any ideas would be appreciated.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

Thank you tbharathkumar for your response.

Unfortunately, this is more complex. Both Line 2 and Line 5 need to be retained eventhough they are duplicates. Please read my first post again.

Thanks.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

the keychange column of sort stage can be implemented very easily in the transformer stage using a combination of stage variables by holding the key value of current and previous records
tbharathkumar
Participant
Posts: 26
Joined: Mon Aug 27, 2007 6:27 am
Location: Des Moines

Post by tbharathkumar »

Designed mentioned will retain both line 2 and line 5.... As mentined by another user you can implement in transformer also(DS Ver 8.x)
Regards,
Bharath Tipirisetty
VijayDS
Participant
Posts: 38
Joined: Thu Jun 18, 2009 3:50 am

Re: Removing Duplicates : unique situation

Post by VijayDS »

Hi ABC,

This you can do it with remove duplicates stage.
One link use retain first record and in the other link use retain last record. Now you will get the 2 and 5 records.
Thanks
Vijay
blewip
Participant
Posts: 81
Joined: Wed Nov 10, 2004 10:55 am
Location: London

Re: Removing Duplicates : unique situation

Post by blewip »

From your requirements I don't think any suggestion given so far will work.

I believe you want any duplicates that are together to be removed yet if they come later on (after a new key) they should be kept.

First of all the processing should all be done in sequential mode, or else partitioning could screw up the order of the records. This appears to be important.

Then use Stage variable in a transformer to hold the old values, if they all match exclude the record, else pass them through.
Modern Life is Rubbish - Blur
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

blewip, you are right. You seem to understand the requirement. I know about the sequential mode part.

Can you please give little bit more detail about your solution?

Thanks.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

If the data is coming from a sequential file in the order that you have mentioned.

in the transformer check if the current record is same as the previous record.( on the key columns)

if they are same then drop the second record.

this way you will be able to retain the 2 and the 5 records.
paultechm
Participant
Posts: 27
Joined: Wed Jul 25, 2007 2:09 am

Re: Removing Duplicates : unique situation

Post by paultechm »

Sort the input data based on the column B and enable the key change column


Use filter to spilt the input data based on change value '1' or '0'
Left Join the '0' stream with '1' stream based on the column B and retrieve the Column A from the '1' Stream .

Filter the data based on Stream0.ColumnA- Stream1.ColumnB<>1 Pass it into a funnel

Funnel the Stream 1 also with this, this will give you the required out put

Note: In case column A is not a Numeric value create a dummy column with Numeric value
blewip
Participant
Posts: 81
Joined: Wed Nov 10, 2004 10:55 am
Location: London

Re: Removing Duplicates : unique situation

Post by blewip »

paultechm wrote:Sort the input data based on the column B and enable the key change column
This is far too complex and we cannot sort the data for starters

SamYamKrishna is correct, you will need svOldCol1, svOldCol2 etc.

Should be fairly straightforward.
Modern Life is Rubbish - Blur
VijayDS
Participant
Posts: 38
Joined: Thu Jun 18, 2009 3:50 am

Post by VijayDS »

Very simple solution for this to get only 2 & 5 records.

This you can do it with remove duplicates stage. In one link use remove duplicates stage to retain first record and use remove duplicates stage in the second link to retain last record option and funnel athe data from both the links. Now you will get the 2 and 5 records.

I hope you understand the logic.
Thanks
Vijay
blewip
Participant
Posts: 81
Joined: Wed Nov 10, 2004 10:55 am
Location: London

Post by blewip »

VijayDS wrote:I hope you understand the logic.
I understand your logic but it's flawed.

If a copy of record 5 appeared at record 10; you would want records 2,5 and 10 to be kept. Your logic would only keep 2 and 10.
Modern Life is Rubbish - Blur
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

blewip and krishna are correct in their suggestions. Do not sort the file. Read it with sequential file operator and follow that with a transformer running in sequential mode. Use stage variables to detect the change in key column values: if any changes, keep the current record; if no changes then drop the record.

svKeyChange: if svInKey1 = svPrevKey1 and svInKey2 = svPrevKey2 and svInKey3 = svPrevKey3 then 0 else 1

In your output link constraint: svKeyChange = 1

You need to keep the data in one partition (sequential) until you have dropped the records or you will not get the results you need.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Assuming that your sequential file contains the rows in that order, in the Transformer, create a stage variable where you compare the previous value of the concatenated keys with the current value.
gateleys
Post Reply