How do we capture duplicate records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
pongal
Participant
Posts: 77
Joined: Thu Mar 04, 2004 4:46 am

How do we capture duplicate records

Post by pongal »

how do we capture duplicate records while extracting the data from any database?

Here i am using SQL Server database Table for extraction.
Is there any function or routine kind of thing in datastage to capture duplicate records?
Here i don't need the user defined query to capture duplicates,Because source does not have key fields.
Can anyone suggest the easiest method for tracking duplicates or bad data?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Depends.

What is your definition of "duplicate" here? All columns the same or just a selected subset of columns the same? Or even just one column (a candidate key but not a key)?

The techniques exist for all of these, but they're different.

They can be cumbersome or elegant. For example, if a duplicate were "all columns the same", a cumbersome approach would be to compare each column of every pair of rows.

Sorting the extracted result set immediately provides a level of elegance.

After that, the algorithm really does depend on what you're seeking to do.

And what do you mean by "capture"? If you want to prevent duplicate rows from being further processed, it's a simple matter of extracting using SELECT DISTINCT rather than plain SELECT.

If you want to answer the question "have I seen this row before" you need to maintain some kind of reference lookup mechanism that can be checked every row (or every adjacent pair of rows if the input is sorted).

And so on. Be more specific about the requirement and we can be more specific about the advice we offer.

(And welcome aboard the DSXChange!)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Don't you just love database tables without primary keys?

A very simple dedupe method is to extract the table into a hash file which has a primary key defined. The hash file can have a simple key or it can have a key consisting of every field in the table. Usually there are some fields on your table that you don't want included in the deduping such as creation date fields. This will give you a deduped view of your data.

This wont capture or tell you what the duplicates were. To identify duplicates you will need to do some type of lookup as Ray has suggested. I'm not sure whether identifying duplicates during the ETL process will do you much good, what are you going to do with them then? Why not write a SQL statement that finds the duplicates in the table and produce a report. That way you can find out what the duplicates are and how you are going to handle them before you design your ETL processes.
cnorris
Premium Member
Premium Member
Posts: 3
Joined: Tue Oct 14, 2003 9:23 am

could use a stage variable

Post by cnorris »

Another way is to create a variable to hold the last key value and then compare the old and new key variables to set a flag to be evaluated by the constraint part of a transform, the duplicate row could be sent to a flat file for reporting.
CN
pongal
Participant
Posts: 77
Joined: Thu Mar 04, 2004 4:46 am

Post by pongal »

I would be more specific this time
all columns have same values throughout each record.
eg:-
extracting Source table
corpcode location account project cfis
55020 AB000 7897890 NULL NULL
55020 AB000 7897890 NULL NULL
44578 NL789 2345566 kpaccc NULL
44578 NL789 2345566 kpaccc NULL

destination table
Loadid RecrdTransformed corpcode location account project cfis
key null not null not null not null not null not null
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There's no magic bullet. You will need to create a process to compare one row to another, just like you would with any other 'tool' or database. A couple of suggestions off the top of my head:

Drop the output to a flat file and then use your O/S to remove the duplicates. For example, from UNIX you could use the sort -u command, you'd have to tell me if you have any equivalent under Windows.

Sort your data. Create a new field which is the concatenated result of all of your fields. Make sure you convert the nulls to something else consistantly first - like a single space - then concatenate. Then, use a Stage Variable as others have suggested to compare row to row and discard all duplicates.
-craig

"You can never have too many knives" -- Logan Nine Fingers
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

How do we capture duplicate records

Post by datastage »

typically I use a method like Craig suggests, where I concatenate the columns into one string and use a stage variable to compare.

I've never used checksum CRC's much so I'll pose this question to the gang: Is using checksum() another valid way to search for duplicates (especially is the source is text file and there isn't the option of querying for dups on the db)? Or is this method more valuable for changed data capture and for duplicate elimination?
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Checksum is not a perfect way to check for duplicates, because there is a slim chance (approximately one in 2,147,483,648 for CRC32) that a false positive will occur - that is, that different data will generate the same checksum. That's close enough for many folks.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

thanks Ray. that's why I've tended to stay away from checksum in the past, but seeing an actual number for odds should clear any doubt as to the effectiveness of the method
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's only approximate, remember. The chance of a false positive could be as high as one in 2,147,483,647. :!:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

Okay...let's phrase it this way...forget the exact odds. Here is the scenario that counts:

You are a consultant ETL developer. You use this method, and with your luck you get a false positive causing an error during your ETL process. You explain to the client the method used and what happened. Would you still get blamed or do you get off the hook? :roll: That's what really matters!
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Here's what REALLY matters. By the time this happens:

A) Does anybody notice?
B) Are you still there?

:lol: Kidding, of course. Sorta. :wink:

I'd amend one of your statements to: "You document the method used and what could happen."
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply