Hash partition key's datatype changes partition behavior?

djwagner · Post by **djwagner** » Wed Feb 02, 2011 3:32 pm

Hello,

I'm running Datastage 8.1 FP1 on Windows Server 2003.

I'm experiencing a behavior that the documentation does not mention and is different than I expect.

When using the Hash partitioning method in a parallel transformer stage, record distribution across the nodes appear to be dependent upon the datatype of the selected hash key field.

I created a simple job to remove all other logic and isolate the odd behavior that I am experiencing. For example: The test has a 4-node config file using the hash partition method with ColA as the selected partition key. My input data set is 16 records total with the following values:

ColA
"1"
"1"
"1"
"1"
"2"
"2"
"2"
"2"
"3"
"3"
"3"
"3"
"4"
"4"
"4"
"4"

Based on the job monitor in Director,
When ColA's datatype is set as Varchar (works as intended)
node1=4 records
node2=4 records
node3=4 records
node4=4 records
When ColA's datatype is set as Integer (not sure why this occurs)
node1=0 records
node2=4 records
node3=12 records
node4=0 records
When ColA's datatype is set as Decimal (not sure why this occurs)
node1=16 records
node2=0 records
node3=0 records
node4=0 records

Any explanation for why I'm receiving this behavior?

Thanks,
David

ray.wurlod · Post by **ray.wurlod** » Wed Feb 02, 2011 3:57 pm

Code: Select all

Value    Datatype           Internal Storage
  1        string[1]        00110001
  1        int32            00000000000000000000000000000001
  1        decimal[1,0]     00000001

"1" is Char(49)

djwagner · Post by **djwagner** » Mon Feb 07, 2011 11:14 am

I understand that the internal representation is different based on datatype, but shouldn't the partitioning be consistent no matter what datatype is chosen?

For this test, I'm reading in from a sequential file and can have the data successfully converted between data types of my choosing. But I would think that regardless of the data type chosen, the 1s should be partitioned together, the 2s should be partitioned together, the 3s partitioned together, and so on. That would be no matter if it's an 00000001, 00000010, 00000011, etc. or a char(49), char(50), char(51) etc.

See what I mean? :)

I apologize if I'm missing your point...

ray.wurlod · Post by **ray.wurlod** » Mon Feb 07, 2011 12:03 pm

My point is that the partitioning is driven by the raw (binary) values.

Mike · Post by **Mike** » Mon Feb 07, 2011 7:23 pm

Given that the partitioning algorithm is undocumented and subject to change, the only thing that you can reliably assume is that any difference in metadata of any sort *may* result in a different partition assignment.

I always make sure that my partitioning keys have the exact same data type, length and nullability.

To extend Ray's point a bit further, I wouldn't even assume that different data types having the same binary representation would partition the same... based on experience with another tool having a very similar parallel architecture.

Mike