to get duplicate records

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 3:57 am

i hav 2 databases first includes duplicates data and second one after removing duplicate data
eg first one contain 100 records and second one 90 records
and 10 are the duplicate entries, i want to find these 10 entries, how to do that?

koolnitz · Post by **koolnitz** » Tue Nov 27, 2007 4:17 am

Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.

Karine · Post by **Karine** » Tue Nov 27, 2007 4:17 am

It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key

devidotcom · Post by **devidotcom** » Tue Nov 27, 2007 4:23 am

Take the first dataset with 100 records and use the sort stage to sort and remove duplicate records.

Set Allow duplicates to true in the sort stage
We can capture the duplicate records using the create change key column option under the stage-> properties tab
This new key would have the value of 1 for the first record and the duplicate records as 0's. Use a filter stage next and filter the records which are 0's

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 4:43 am

Karine wrote:It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key

COUNT DOESNT WORK IN DATASTAGE...

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 4:46 am

koolnitz wrote:Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.

pls post that sql statement that should work in datastage...

koolnitz · Post by **koolnitz** » Tue Nov 27, 2007 4:47 am

Pls post what is your Database.
Btw, you could also google for the command

koolnitz · Post by **koolnitz** » Tue Nov 27, 2007 4:55 am

I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.

ray.wurlod · Post by **ray.wurlod** » Tue Nov 27, 2007 5:48 am

abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...

Yes it does, when it's done right.

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 6:32 am

koolnitz wrote:I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.

this will work fine but will show duplicate data only once even if more than 2 records are there

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 6:35 am

ray.wurlod wrote:
abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...
Yes it does, when it's done right.

ya, sry for that...

ray.wurlod · Post by **ray.wurlod** » Tue Nov 27, 2007 5:05 pm

HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING clause. Use a "fork join" with an Aggregator stage to get the count for each group, then run one copy of these results through a Filter stage based on the count being greater than 1. These (tgt2) are your duplicate rows. The other copy you run through a Remove Duplicates stage to get your unique rows including one copy of each duplicate row.

Code: Select all

 ----->  Copy  --------------->  Join  ---->  Copy  ---->  Filter  ---->  tgt2
          |                       ^             |
          |                       |             |
          +---->  Aggregator  ----+             +---->  RemDups  ---->  tgt1

Make very sure that your partitioning is key-based on the grouping key(s).

abhay10 · Post by **abhay10** » Tue Nov 27, 2007 9:19 pm

ray.wurlod wrote:HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING claus ...

no it will give 10 records only when each of those 10 records have occured only 2 times, suppose the same record has occured eleven times than it will give only one record since all other 9 are same, but i want all the 10 records...

ray.wurlod · Post by **ray.wurlod** » Tue Nov 27, 2007 9:42 pm

The job design I posted will give you what you want.

ray.wurlod · Post by **ray.wurlod** » Tue Nov 27, 2007 9:45 pm

For less than 30c per day (approx Rs12) you can purchase premium membership that will allow you to read the entire posts of the five premium posters. Revenue from premium memberships is 100% allocated to the hosting and bandwidth costs of DSXchange; like all other posters the premium posters are unpaid.

to get duplicate records

to get duplicate records

Pre-emptive post