Hello,
I'm running Datastage 8.1 FP1 on Windows Server 2003.
I'm experiencing a behavior that the documentation does not mention and is different than I expect.
When using the Hash partitioning method in a parallel transformer stage, record distribution across the nodes appear to be dependent upon the datatype of the selected hash key field.
I created a simple job to remove all other logic and isolate the odd behavior that I am experiencing. For example: The test has a 4-node config file using the hash partition method with ColA as the selected partition key. My input data set is 16 records total with the following values:
ColA
"1"
"1"
"1"
"1"
"2"
"2"
"2"
"2"
"3"
"3"
"3"
"3"
"4"
"4"
"4"
"4"
Based on the job monitor in Director,
When ColA's datatype is set as Varchar (works as intended)
node1=4 records
node2=4 records
node3=4 records
node4=4 records
When ColA's datatype is set as Integer (not sure why this occurs)
node1=0 records
node2=4 records
node3=12 records
node4=0 records
When ColA's datatype is set as Decimal (not sure why this occurs)
node1=16 records
node2=0 records
node3=0 records
node4=0 records
Any explanation for why I'm receiving this behavior?
Thanks,
David
Hash partition key's datatype changes partition behavior?
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Code: Select all
Value Datatype Internal Storage
1 string[1] 00110001
1 int32 00000000000000000000000000000001
1 decimal[1,0] 00000001
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I understand that the internal representation is different based on datatype, but shouldn't the partitioning be consistent no matter what datatype is chosen?
For this test, I'm reading in from a sequential file and can have the data successfully converted between data types of my choosing. But I would think that regardless of the data type chosen, the 1s should be partitioned together, the 2s should be partitioned together, the 3s partitioned together, and so on. That would be no matter if it's an 00000001, 00000010, 00000011, etc. or a char(49), char(50), char(51) etc.
See what I mean? :)
I apologize if I'm missing your point...
For this test, I'm reading in from a sequential file and can have the data successfully converted between data types of my choosing. But I would think that regardless of the data type chosen, the 1s should be partitioned together, the 2s should be partitioned together, the 3s partitioned together, and so on. That would be no matter if it's an 00000001, 00000010, 00000011, etc. or a char(49), char(50), char(51) etc.
See what I mean? :)
I apologize if I'm missing your point...
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Given that the partitioning algorithm is undocumented and subject to change, the only thing that you can reliably assume is that any difference in metadata of any sort *may* result in a different partition assignment.
I always make sure that my partitioning keys have the exact same data type, length and nullability.
To extend Ray's point a bit further, I wouldn't even assume that different data types having the same binary representation would partition the same... based on experience with another tool having a very similar parallel architecture.
Mike
I always make sure that my partitioning keys have the exact same data type, length and nullability.
To extend Ray's point a bit further, I wouldn't even assume that different data types having the same binary representation would partition the same... based on experience with another tool having a very similar parallel architecture.
Mike