Hash Partitioning on fields with different data lengths

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
arib0nd
Participant
Posts: 6
Joined: Wed Jan 19, 2005 3:35 pm

Hash Partitioning on fields with different data lengths

Post by arib0nd »

Quick question on the suitability of hash partitioning (or a suitable alternative/work around) for the following scenario -

Assuming I am reading two different data sets which have the same key (lets say employee id).
Data Set A
employee# - char 5
example values "A100 " , " A101"

Data Set B
Employee# - char 8
example values " A100" , " A101 "

If I partition both data sets using Hash partitioning and sort them, will the join later downstream be successful?
By the time I do the join, I have converted both to the same data type.. in this case varchar(4), but at the time of partitioning, since I am trying to do it up front, they were both different (char 5 and char 8 ) .
Would values such as "A100 " and " A100" across both data sets end up in the same partition or not?

Also would I need to wait to do the sort till after I have combined the data to the same data type as well , or can I also sort them up front?

Thanks!
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's no guarantee that "A100" and " A100" will be on the same partition, since they're different values. So you will need to make them the same, possibly via string_trim() in a Modify stage or Trim() in a Transformer stage.

Values that are identical will end up on the same partition if the same partitioning algorithm (based on the same key(s)) is used.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply