Hi all,
Both Remove Dulplicates Stage and Sort stages perform the following operations:
1. Both Sort the data
2. Both remove duplicates.
So I was wondering whats the ideal situation where we could use exclusively Sort stage or Remove Duplictaes Stage. I mean can you please explain me when have to use sort stage only and cant use remove duplicates stage in that case.
Hope I am clear
Thanks
Rajani
remove duplicates stage and sort stage
Moderators: chulett, rschirm, roy
Re: remove duplicates stage and sort stage
Hi thanks for the reply. I have gone through the manual. It doesnt give much info on specific differences. Can you please post those specific differences?
Thanks
Rajani
Thanks
Rajani
---Raj
Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.
Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.
PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.
Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.
PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
If performance is one of the criteria, and if the SORT can do what you intend to do with a REMOVE DUPLICATE stage, go with the SORT stage, as Remove Duplicate seems to slow down things.[also recommended by IBM consultants]DSguru2B wrote:Remove duplicate stage requires sorted data, sort stage does not.
Remove duplicate stage can retain first or last duplicate whereas sort stage only marks the first duplicate with 1 and the rest with 0. So its rather easy to retain last row using Remove Duplicate Stage.
Both these stages do exactly what they are called. It just happens that sort stage can also remove duplicates.
PS: Most stages have in-stage sort capabilities. That does not mean they are the same as the SORT stage.