Relationship between Nodes and Partitions...

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
riyazahmed.d
Participant
Posts: 3
Joined: Sun Nov 18, 2018 11:16 pm
Location: Pune

Relationship between Nodes and Partitions...

Post by riyazahmed.d »

https://www.ibm.com/support/knowledgece ... ioner.html

Question-->

Does Hash Partition depend on the number of Nodes defined in the Configuration file? As per the example given on IBM page, it seems there is no relation between number of Nodes and Partitions, and a single node can have many Partitions.

If possible please explain using the example given on IBM page (URL attached above).
- Riyaz
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

Let's consider first the Modulus algorithm (applicable only to integers). The integer value is divided by the number of nodes and the remainder (obtained through the Mod() function) is the node number to which that particular value will be directed.

The only difference for the Hash algorithm is that it is applicable to any data type. All of the characters in the value are processed via an algorithm (something like adding all the ASCII values, though a little more complex than that) to produce a large integer value called the hash value. This is divided by the number of nodes and, as with Modulus algorithm, the remainder yields the node number to which that particular value will be directed.

Node numbers, like most values in the DataStage parallel engine, start from 0.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
riyazahmed.d
Participant
Posts: 3
Joined: Sun Nov 18, 2018 11:16 pm
Location: Pune

Post by riyazahmed.d »

Thanks Ray for the detailed explanation.

Below is the paragraph from the IBM page on Hash Partitioner:

"When hash partitioning, you should select hashing keys that create a large number of partitions. For example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a maximum age of 190), to yield 1,500,000 possible partitions. "

As per your above explanation, if we have 4 nodes we get remainders 0,1,2,3 and accordingly data will be divided among 4 nodes but as per the IBM paragraph we can have "n" number of partitions depending upon the data in the hash key and it is saying 15 million partitions are possible.

Could you please explain using the example given on the IBM page?
- Riyaz
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

If the data isn't suited to partitions "as is", run some of it through a Checksum stage and use that field as the partition key. It should give solid results.
jneasy
Participant
Posts: 32
Joined: Sun Jan 29, 2012 8:47 pm
Location: Australia

Post by jneasy »

If the data isn't suited to partitions what is the point of using a checksum for partitioning? For example if you want to partition on say a gender code which potentially only has two distinct values and you checksum on the gender code you will still only have two distinct values produced from the checksum. The only difference you will have a much larger additional field.

If all you are worried about is even data distribution across your nodes and no requirement to rejoin then why not Round Robin?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The IBM example was written by someone who doesn't fully understand what is happening or, perhaps, has a million partitions available.

In real life, your data can only be partitioned over the number of nodes available.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
riyazahmed.d
Participant
Posts: 3
Joined: Sun Nov 18, 2018 11:16 pm
Location: Pune

Post by riyazahmed.d »

Thanks everyone!!

It is pretty much clear now :)
- Riyaz
Post Reply