hash partioning problem in remove duplicate stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kpavan2004
Participant
Posts: 9
Joined: Sun Oct 19, 2008 7:09 am

hash partioning problem in remove duplicate stage

Post by kpavan2004 »

Hi
Requirement:I have a requirement to get the latest record based on timestamp field.

Scenario 1: So i have used remove duplicate stage for this.
I have done the hash partitioning on the key column and timestamp field and selected "perform sort" option in the input link of this stage.
and in the remove duplicate stage properties key is the key column[excluding timestamp field]. But the ouput is having duplicate rows for the key.
When i checked i came to know that the records are not in the same partition. So hash partition is not working correctly. Any
thing wrong in the logic?

Scenario 2: If i use sort stage where in the input link i did hashing on key column , and in the stage properties i sorted on key column and timestamp field
and then i used remove duplicate stage on key column the output is without duplicate records.It is working correctly but


Could anyone please tell me why in scenario one the hash partitining is not working? I thought it is due to duplicates but even
Pavan
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

I believe that in scenario 1 it is working exactly as you have coded it. You are partitioning on key and timestamp which will not produce the same data distribution as simply partitioning on the key as in your second scenario.
vinothkumar
Participant
Posts: 342
Joined: Tue Nov 04, 2008 10:38 am
Location: Chennai, India

Post by vinothkumar »

You should not include timestamp for hash partitioning for Scenario1
kris007
Charter Member
Charter Member
Posts: 1102
Joined: Tue Jan 24, 2006 5:38 pm
Location: Riverside, RI

Post by kris007 »

You need to only partition on the key field but sort on the Timestamp field within the Remove Duplicates stage to achieve the required results. Based on what data you want to retain, you can sort ascending/descending on the TimeStamp field.

Hope that helps.
Kris

Where's the "Any" key?-Homer Simpson
Post Reply