Page 1 of 1

To find out duplicate records

Posted: Thu Jun 01, 2006 3:55 am
by Amarpreet
I have some duplicate records in my source, which are getting rejected at the target. I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file. How can I do this? Reject link from transformer is not working, as the records are getting rejected at the target (at end) itselt.

Would appreciate any suggestions....

Thanks,

Re: To find out duplicate records

Posted: Thu Jun 01, 2006 5:25 am
by jayantthakore
Hi Amarpreet,
In the final Transformer Stage you can concate all key columns to a long String and Store it into a Stage Variable named STG1.
Now another variable STG2 = STG1.
Compare the two.by this you will capture the rows in the two variables.Check them and put the constaint accordingly on the links of Transformer.
But it is necessary to sort the data on the key columns before doing this.

Re: To find out duplicate records

Posted: Thu Jun 01, 2006 5:55 am
by dssiddu
[img]I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file.
[/img]
As you said You want whole copy of record and also want to remove the duplicates then use copy stage before remove duplicate stage.


file-->copystage----->1) File or Table
2) Remove Duplicate stage

Re: To find out duplicate records

Posted: Tue Jun 06, 2006 2:33 am
by sreeni
Hi

If you are using Target Oracla Enterprise stage which has reject option to reject duplicate records which we can load into dataset or sequential file based on key columns in target table.


SOURCE ---> ORA ENTERPRICE STAGE ----------> SEQUENTIAL FILE.

I think this would help to your requirement.

Thanks

Posted: Tue Jun 06, 2006 7:52 am
by thompsonp
Using the Oracle stage limits you to upserts and the rejected data is for rows that could not be inserted or updated. This means the duplicate could already reside in the database and is not quite the same as the problem described in this post.

A remove duplicates stage requires your data to be sorted so presumably you have a sort stage prior to it? In the sort stage generate a key change on the columns that define the key. If you are sorting by more than the key columns, for instance to ensure the first record is the one you want to keep, you can achieve the same thing using 2 sorts and generating a cluster key column on the key fields.

After the sort simply filter on the key change or cluster key change to identify duplicates (values of 0).

Posted: Tue Jun 06, 2006 11:51 am
by mujeebur
Its a round about way, but you can achieve what do you want. From the transformer output your data into both DataSet(hashed file) and a sequential file. DataSet will automatically removes the duplicates based on the key column.

Now you can compare/diff both DataSet and Seq File to output the duplicated record into a file. From DataSet(hashed) you can load into Oracle Stage.

Posted: Tue Jun 06, 2006 12:21 pm
by vinaymanchinila
Using two sorts gave the results with good performance.

Re: To find out duplicate records

Posted: Tue Jun 06, 2006 4:52 pm
by s_boyapati
Use sort stage with Hash method and set "Create Key Change Column" property to true. So that first row will get this new column value as 1 and duplicates as 0. Then, Use switch or filter stage to fileter rows with 1 and 0. Is is simple way and shows better performance also. I used that.

Sree

Amarpreet wrote:I have some duplicate records in my source, which are getting rejected at the target. I can use remove duplicate stage to remove those records. But I need to capture those records in some other table/file. How can I do this? Reject link from transformer is not working, as the records are getting rejected at the target (at end) itselt.

Would appreciate any suggestions....

Thanks,

Posted: Tue Jun 06, 2006 9:32 pm
by Amarpreet
Thanks all.
Its working fine.

Posted: Wed Jun 07, 2006 4:42 am
by thompsonp
mujeebur wrote ...
Its a round about way, but you can achieve what do you want. From the transformer output your data into both DataSet(hashed file) and a sequential file. DataSet will automatically removes the duplicates based on the key column.

Now you can compare/diff both DataSet and Seq File to output the duplicated record into a file. From DataSet(hashed) you can load into Oracle Stage.
This is a PX job so is using datasets not hashed files. Datasets do not automatically remove duplicates.