to get duplicate records

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

to get duplicate records

Post by abhay10 »

i hav 2 databases first includes duplicates data and second one after removing duplicate data
eg first one contain 100 records and second one 90 records
and 10 are the duplicate entries, i want to find these 10 entries, how to do that?
koolnitz
Participant
Posts: 138
Joined: Wed Sep 07, 2005 5:39 am

Post by koolnitz »

Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.
Last edited by koolnitz on Tue Nov 27, 2007 4:32 am, edited 2 times in total.
Nitin Jain | India

If everything seems to be going well, you have obviously overlooked something.
Karine
Participant
Posts: 37
Joined: Sun Feb 18, 2007 3:33 am

Post by Karine »

It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key
devidotcom
Participant
Posts: 247
Joined: Thu Apr 27, 2006 6:38 am
Location: Hyderabad

Post by devidotcom »

Take the first dataset with 100 records and use the sort stage to sort and remove duplicate records.

Set Allow duplicates to true in the sort stage
We can capture the duplicate records using the create change key column option under the stage-> properties tab
This new key would have the value of 1 for the first record and the duplicate records as 0's. Use a filter stage next and filter the records which are 0's
abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

Post by abhay10 »

Karine wrote:It would be easier to run a query against your first database with a query like:

SELECT COL1 FROM DBASE GROUP BY COL1 HAVING COUNT(*) > 1

assuming col1 is your unique key

COUNT DOESNT WORK IN DATASTAGE...
abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

Post by abhay10 »

koolnitz wrote:Assuming that 'databases' mean 'tables' from your statement "i hav 2 databases...".
You can find the duplicate records without actually relating (e.g. join) the two tables, by firing the appropriate SQL statement against the table that has duplicate rows.

pls post that sql statement that should work in datastage...
koolnitz
Participant
Posts: 138
Joined: Wed Sep 07, 2005 5:39 am

Post by koolnitz »

Pls post what is your Database.
Btw, you could also google for the command :idea:
Nitin Jain | India

If everything seems to be going well, you have obviously overlooked something.
koolnitz
Participant
Posts: 138
Joined: Wed Sep 07, 2005 5:39 am

Post by koolnitz »

I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
Nitin Jain | India

If everything seems to be going well, you have obviously overlooked something.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...
Yes it does, when it's done right.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

Post by abhay10 »

koolnitz wrote:I'm assuming that your table has 3 columns with name col1, col2 and col3.
Here is a generic query that meets your requirement...

SELECT col1, col2, col3, count(*) as Count
FROM tab_name
GROUP BY col1, col2, col3
HAVING COUNT(*)>1

In the Database stage in your DS job, define four columns - col1, col2, col3 and Count.
this will work fine but will show duplicate data only once even if more than 2 records are there
abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

Post by abhay10 »

ray.wurlod wrote:
abhay10 wrote:COUNT DOESNT WORK IN DATASTAGE...
Yes it does, when it's done right.
ya, sry for that...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING clause. Use a "fork join" with an Aggregator stage to get the count for each group, then run one copy of these results through a Filter stage based on the count being greater than 1. These (tgt2) are your duplicate rows. The other copy you run through a Remove Duplicates stage to get your unique rows including one copy of each duplicate row.

Code: Select all

 ----->  Copy  --------------->  Join  ---->  Copy  ---->  Filter  ---->  tgt2
          |                       ^             |
          |                       |             |
          +---->  Aggregator  ----+             +---->  RemDups  ---->  tgt1
Make very sure that your partitioning is key-based on the grouping key(s).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abhay10
Participant
Posts: 31
Joined: Tue Nov 20, 2007 11:39 pm
Location: Bangalore

Post by abhay10 »

ray.wurlod wrote:HAVING COUNT(*) = 1 gives you the 90 records and HAVING COUNT(*) > 1 gives you the other 10. Isn't that what you wanted?

You can do this within DataStage, effectively emulating the HAVING claus ...
no it will give 10 records only when each of those 10 records have occured only 2 times, suppose the same record has occured eleven times than it will give only one record since all other 9 are same, but i want all the 10 records...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The job design I posted will give you what you want.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Pre-emptive post

Post by ray.wurlod »

For less than 30c per day (approx Rs12) you can purchase premium membership that will allow you to read the entire posts of the five premium posters. Revenue from premium memberships is 100% allocated to the hosting and bandwidth costs of DSXchange; like all other posters the premium posters are unpaid.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply