Hash partition key's datatype changes partition behavior?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
djwagner
Premium Member
Premium Member
Posts: 17
Joined: Mon Jul 31, 2006 11:37 am

Hash partition key's datatype changes partition behavior?

Post by djwagner »

Hello,

I'm running Datastage 8.1 FP1 on Windows Server 2003.

I'm experiencing a behavior that the documentation does not mention and is different than I expect.

When using the Hash partitioning method in a parallel transformer stage, record distribution across the nodes appear to be dependent upon the datatype of the selected hash key field.

I created a simple job to remove all other logic and isolate the odd behavior that I am experiencing. For example: The test has a 4-node config file using the hash partition method with ColA as the selected partition key. My input data set is 16 records total with the following values:

ColA
"1"
"1"
"1"
"1"
"2"
"2"
"2"
"2"
"3"
"3"
"3"
"3"
"4"
"4"
"4"
"4"

Based on the job monitor in Director,
When ColA's datatype is set as Varchar (works as intended)
node1=4 records
node2=4 records
node3=4 records
node4=4 records
When ColA's datatype is set as Integer (not sure why this occurs)
node1=0 records
node2=4 records
node3=12 records
node4=0 records
When ColA's datatype is set as Decimal (not sure why this occurs)
node1=16 records
node2=0 records
node3=0 records
node4=0 records


Any explanation for why I'm receiving this behavior?

Thanks,
David
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

Value    Datatype           Internal Storage
  1        string[1]        00110001
  1        int32            00000000000000000000000000000001
  1        decimal[1,0]     00000001
"1" is Char(49)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djwagner
Premium Member
Premium Member
Posts: 17
Joined: Mon Jul 31, 2006 11:37 am

Post by djwagner »

I understand that the internal representation is different based on datatype, but shouldn't the partitioning be consistent no matter what datatype is chosen?

For this test, I'm reading in from a sequential file and can have the data successfully converted between data types of my choosing. But I would think that regardless of the data type chosen, the 1s should be partitioned together, the 2s should be partitioned together, the 3s partitioned together, and so on. That would be no matter if it's an 00000001, 00000010, 00000011, etc. or a char(49), char(50), char(51) etc.

See what I mean? :)

I apologize if I'm missing your point...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

My point is that the partitioning is driven by the raw (binary) values.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Given that the partitioning algorithm is undocumented and subject to change, the only thing that you can reliably assume is that any difference in metadata of any sort *may* result in a different partition assignment.

I always make sure that my partitioning keys have the exact same data type, length and nullability.

To extend Ray's point a bit further, I wouldn't even assume that different data types having the same binary representation would partition the same... based on experience with another tool having a very similar parallel architecture.

Mike
Post Reply