remove duplicates stage and sort stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rajanikc
Participant
Posts: 15
Joined: Wed Jan 24, 2007 11:11 am

remove duplicates stage and sort stage

Post by rajanikc »

Hi all,
Both Remove Dulplicates Stage and Sort stages perform the following operations:

1. Both Sort the data
2. Both remove duplicates.

So I was wondering whats the ideal situation where we could use exclusively Sort stage or Remove Duplictaes Stage. I mean can you please explain me when have to use sort stage only and cant use remove duplicates stage in that case.
Hope I am clear
Thanks
Rajani
---Raj
dsedi
Participant
Posts: 220
Joined: Wed Jun 02, 2004 12:38 am

Post by dsedi »

There a a lot of Unique features available for each stage.

for example,in sort if you say allow duplicates=false then the first rows is retained.On remove duplicates you get an option which one to choose..

Edi
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

For more specifics, its a better idea, to read the manual about these stages to get a clear understanding of when these stages to use when one stage takes precedecence over the other.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
rajanikc
Participant
Posts: 15
Joined: Wed Jan 24, 2007 11:11 am

Re: remove duplicates stage and sort stage

Post by rajanikc »

Hi thanks for the reply. I have gone through the manual. It doesnt give much info on specific differences. Can you please post those specific differences?
Thanks
Rajani
---Raj
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.

Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.

PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Post by vijayrc »

DSguru2B wrote:Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.

Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.

PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.
If performance is one of the criteria, and if the SORT can do what you intend to do with a REMOVE DUPLICATE stage, go with the SORT stage, as Remove Duplicate seems to slow down things.[also recommended by IBM consultants]
rajanikc
Participant
Posts: 15
Joined: Wed Jan 24, 2007 11:11 am

Post by rajanikc »

Thanks for the replies. These postings really helped me.
....Rajani
---Raj
Post Reply