DSXchange

Posted: **Wed Mar 10, 2004 12:46 am**

how do we capture duplicate records while extracting the data from any database?

Here i am using SQL Server database Table for extraction.
Is there any function or routine kind of thing in datastage to capture duplicate records?
Here i don't need the user defined query to capture duplicates,Because source does not have key fields.
Can anyone suggest the easiest method for tracking duplicates or bad data?

Posted: **Wed Mar 10, 2004 2:40 am**

Depends.

What is your definition of "duplicate" here? All columns the same or just a selected subset of columns the same? Or even just one column (a candidate key but not a key)?

The techniques exist for all of these, but they're different.

They can be cumbersome or elegant. For example, if a duplicate were "all columns the same", a cumbersome approach would be to compare each column of every pair of rows.

Sorting the extracted result set immediately provides a level of elegance.

After that, the algorithm really does depend on what you're seeking to do.

And what do you mean by "capture"? If you want to prevent duplicate rows from being further processed, it's a simple matter of extracting using SELECT DISTINCT rather than plain SELECT.

If you want to answer the question "have I seen this row before" you need to maintain some kind of reference lookup mechanism that can be checked every row (or every adjacent pair of rows if the input is sorted).

And so on. Be more specific about the requirement and we can be more specific about the advice we offer.

(And welcome aboard the DSXChange!)

Posted: **Wed Mar 10, 2004 9:44 pm**

Don't you just love database tables without primary keys?

A very simple dedupe method is to extract the table into a hash file which has a primary key defined. The hash file can have a simple key or it can have a key consisting of every field in the table. Usually there are some fields on your table that you don't want included in the deduping such as creation date fields. This will give you a deduped view of your data.

This wont capture or tell you what the duplicates were. To identify duplicates you will need to do some type of lookup as Ray has suggested. I'm not sure whether identifying duplicates during the ETL process will do you much good, what are you going to do with them then? Why not write a SQL statement that finds the duplicates in the table and produce a report. That way you can find out what the duplicates are and how you are going to handle them before you design your ETL processes.

Posted: **Thu Mar 11, 2004 8:37 am**

Another way is to create a variable to hold the last key value and then compare the old and new key variables to set a flag to be evaluated by the constraint part of a transform, the duplicate row could be sent to a flat file for reporting.
CN

Posted: **Fri Mar 12, 2004 3:36 am**

I would be more specific this time
all columns have same values throughout each record.
eg:-
extracting Source table
corpcode location account project cfis
55020 AB000 7897890 NULL NULL
55020 AB000 7897890 NULL NULL
44578 NL789 2345566 kpaccc NULL
44578 NL789 2345566 kpaccc NULL

destination table
Loadid RecrdTransformed corpcode location account project cfis
key null not null not null not null not null not null

Posted: **Fri Mar 12, 2004 7:47 am**

There's no magic bullet. You will need to create a process to compare one row to another, just like you would with any other 'tool' or database. A couple of suggestions off the top of my head:

Drop the output to a flat file and then use your O/S to remove the duplicates. For example, from UNIX you could use the sort -u command, you'd have to tell me if you have any equivalent under Windows.

Sort your data. Create a new field which is the concatenated result of all of your fields. Make sure you convert the nulls to something else consistantly first - like a single space - then concatenate. Then, use a Stage Variable as others have suggested to compare row to row and discard all duplicates.

Posted: **Fri Mar 12, 2004 11:15 am**

typically I use a method like Craig suggests, where I concatenate the columns into one string and use a stage variable to compare.

I've never used checksum CRC's much so I'll pose this question to the gang: Is using checksum() another valid way to search for duplicates (especially is the source is text file and there isn't the option of querying for dups on the db)? Or is this method more valuable for changed data capture and for duplicate elimination?

Posted: **Sat Mar 13, 2004 4:41 pm**

Checksum is not a perfect way to check for duplicates, because there is a slim chance (approximately one in 2,147,483,648 for CRC32) that a false positive will occur - that is, that different data will generate the same checksum. That's close enough for many folks.

Posted: **Sun Mar 14, 2004 12:04 pm**

thanks Ray. that's why I've tended to stay away from checksum in the past, but seeing an actual number for odds should clear any doubt as to the effectiveness of the method

Posted: **Sun Mar 14, 2004 6:04 pm**

It's only approximate, remember. The chance of a false positive could be as high as one in 2,147,483,647.

Posted: **Mon Mar 15, 2004 8:30 am**

Okay...let's phrase it this way...forget the exact odds. Here is the scenario that counts:

You are a consultant ETL developer. You use this method, and with your luck you get a false positive causing an error during your ETL process. You explain to the client the method used and what happened. Would you still get blamed or do you get off the hook?

That's what really matters!

Posted: **Mon Mar 15, 2004 8:35 am**

Here's what REALLY matters. By the time this happens:

A) Does anybody notice?
B) Are you still there?

Kidding, of course. Sorta.

I'd amend one of your statements to: "You document the method used and what could happen."

DSXchange

How do we capture duplicate records

How do we capture duplicate records

could use a stage variable

How do we capture duplicate records