remove duplicates stage and sort stage

rajanikc · Post by **rajanikc** » Thu May 24, 2007 12:28 pm

Hi all,
Both Remove Dulplicates Stage and Sort stages perform the following operations:

1. Both Sort the data
2. Both remove duplicates.

So I was wondering whats the ideal situation where we could use exclusively Sort stage or Remove Duplictaes Stage. I mean can you please explain me when have to use sort stage only and cant use remove duplicates stage in that case.
Hope I am clear
Thanks
Rajani

dsedi · Post by **dsedi** » Thu May 24, 2007 12:42 pm

There a a lot of Unique features available for each stage.

for example,in sort if you say allow duplicates=false then the first rows is retained.On remove duplicates you get an option which one to choose..

Edi

DSguru2B · Post by **DSguru2B** » Thu May 24, 2007 12:47 pm

For more specifics, its a better idea, to read the manual about these stages to get a clear understanding of when these stages to use when one stage takes precedecence over the other.

rajanikc · Post by **rajanikc** » Thu May 24, 2007 12:52 pm

Hi thanks for the reply. I have gone through the manual. It doesnt give much info on specific differences. Can you please post those specific differences?
Thanks
Rajani

DSguru2B · Post by **DSguru2B** » Thu May 24, 2007 1:10 pm

Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.

Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.

PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.

vijayrc · Post by **vijayrc** » Sat May 26, 2007 7:27 pm

DSguru2B wrote:Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.

Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.

PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.

If performance is one of the criteria, and if the SORT can do what you intend to do with a REMOVE DUPLICATE stage, go with the SORT stage, as Remove Duplicate seems to slow down things.[also recommended by IBM consultants]

rajanikc · Post by **rajanikc** » Sat May 26, 2007 7:41 pm

Thanks for the replies. These postings really helped me.
....Rajani

DSXchange

remove duplicates stage and sort stage

remove duplicates stage and sort stage

Re: remove duplicates stage and sort stage