To find out duplicate records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Amarpreet
Participant
Posts: 20
Joined: Tue Apr 04, 2006 11:45 pm
Location: India
Contact:

To find out duplicate records

Post by Amarpreet »

I have some duplicate records in my source, which are getting rejected at the target. I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file. How can I do this? Reject link from transformer is not working, as the records are getting rejected at the target (at end) itselt.

Would appreciate any suggestions....

Thanks,
jayantthakore
Participant
Posts: 42
Joined: Tue Jan 18, 2005 5:48 am

Re: To find out duplicate records

Post by jayantthakore »

Hi Amarpreet,
In the final Transformer Stage you can concate all key columns to a long String and Store it into a Stage Variable named STG1.
Now another variable STG2 = STG1.
Compare the two.by this you will capture the rows in the two variables.Check them and put the constaint accordingly on the links of Transformer.
But it is necessary to sort the data on the key columns before doing this.
regards,
Jayant S Thakore
dssiddu
Participant
Posts: 66
Joined: Mon Nov 07, 2005 10:28 pm
Contact:

Re: To find out duplicate records

Post by dssiddu »

[img]I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file.
[/img]
As you said You want whole copy of record and also want to remove the duplicates then use copy stage before remove duplicate stage.


file-->copystage----->1) File or Table
2) Remove Duplicate stage
sreeni
Participant
Posts: 17
Joined: Thu Jun 03, 2004 5:08 am

Re: To find out duplicate records

Post by sreeni »

Hi

If you are using Target Oracla Enterprise stage which has reject option to reject duplicate records which we can load into dataset or sequential file based on key columns in target table.


SOURCE ---> ORA ENTERPRICE STAGE ----------> SEQUENTIAL FILE.

I think this would help to your requirement.

Thanks
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

Using the Oracle stage limits you to upserts and the rejected data is for rows that could not be inserted or updated. This means the duplicate could already reside in the database and is not quite the same as the problem described in this post.

A remove duplicates stage requires your data to be sorted so presumably you have a sort stage prior to it? In the sort stage generate a key change on the columns that define the key. If you are sorting by more than the key columns, for instance to ensure the first record is the one you want to keep, you can achieve the same thing using 2 sorts and generating a cluster key column on the key fields.

After the sort simply filter on the key change or cluster key change to identify duplicates (values of 0).
mujeebur
Participant
Posts: 46
Joined: Sun Mar 06, 2005 3:02 pm
Location: Philly,USA

Post by mujeebur »

Its a round about way, but you can achieve what do you want. From the transformer output your data into both DataSet(hashed file) and a sequential file. DataSet will automatically removes the duplicates based on the key column.

Now you can compare/diff both DataSet and Seq File to output the duplicated record into a file. From DataSet(hashed) you can load into Oracle Stage.
vinaymanchinila
Premium Member
Premium Member
Posts: 353
Joined: Wed Apr 06, 2005 8:45 am

Post by vinaymanchinila »

Using two sorts gave the results with good performance.
Thanks,
Vinay
s_boyapati
Premium Member
Premium Member
Posts: 70
Joined: Thu Aug 14, 2003 6:24 am
Contact:

Re: To find out duplicate records

Post by s_boyapati »

Use sort stage with Hash method and set "Create Key Change Column" property to true. So that first row will get this new column value as 1 and duplicates as 0. Then, Use switch or filter stage to fileter rows with 1 and 0. Is is simple way and shows better performance also. I used that.

Sree

Amarpreet wrote:I have some duplicate records in my source, which are getting rejected at the target. I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file. How can I do this? Reject link from transformer is not working, as the records are getting rejected at the target (at end) itselt.

Would appreciate any suggestions....

Thanks,
Amarpreet
Participant
Posts: 20
Joined: Tue Apr 04, 2006 11:45 pm
Location: India
Contact:

Post by Amarpreet »

Thanks all.
Its working fine.
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

mujeebur wrote ...
Its a round about way, but you can achieve what do you want. From the transformer output your data into both DataSet(hashed file) and a sequential file. DataSet will automatically removes the duplicates based on the key column.

Now you can compare/diff both DataSet and Seq File to output the duplicated record into a file. From DataSet(hashed) you can load into Oracle Stage.
This is a PX job so is using datasets not hashed files. Datasets do not automatically remove duplicates.
Post Reply