Grid Implementation

U · Post by U » Tue Mar 25, 2014 5:05 pm

As part of planning our next upgrade we are considering a grid implementation. Would those who have travelled this road please share any do's and don'ts, any traps and any best practices?

One question in particular - how is scratchdisk (which ordinarily always be local) handled in a grid configuration? Presumably it has to be on shared disk.

Thank you for your time.

thompsonp · Post by **thompsonp** » Wed Mar 26, 2014 5:00 am

It's a long but detailed read:
https://www.redbooks.ibm.com/Redbooks.n ... .html?Open

The scratch disk is used for temporary files that relate to jobs running on a particular node. There is no need for other compute nodes to access this data. It is therefore normal to have them as local disks (less network traffic and no contention from other nodes).

It's the file system where your other data (datasets, source and output files for example) that needs to be shared as all nodes need access to it.

lstsaur · Post by **lstsaur** » Wed Mar 26, 2014 2:49 pm

I have always built NAS-based configuration grid environment because it's much lower cost and simpler to implement multiple head nodes and HA.
Make sure compute nodes have multiple NIC cards for public and private network connections. And the host name of the nodes is translated correctly in the private network. Use multiple 10gb switches with MOU enabled on the private network instead of 1 gb switch.

Scratch space of the compute nodes in a grid environment, it must be a local disk, not NAS-mounted or NFS-mounted.

Good Luck.

U · Post by U » Wed Mar 26, 2014 5:55 pm

Thank you both.

lstsaur, can you please advise what MOU is in this context?

PaulVL · Post by **PaulVL** » Thu Mar 27, 2014 10:04 am

If going grid, factor in CPU usage and not just slots as a means of determining the next server. Don't send a job to a server that is above ... 85% CPU usage.

I prefer having Priority determined at your job scheduler level, not the GRID level. You will still need at least 4 priorities in your grid HOT, HIGH, NORMAL, LOW. Every job is submitted as NORMAL. If you want a GZIP, COMPRESS, etc... to be run on your data, submit it to the grid with a HIGH priority. Have an admin heartbeat job (which tests the health of datastage submitting grid jobs) sent as HOT priority.

The above is not requirements by any means, but just how I prefer to set things up.

You will also have to decide on how many slots per core you want to run with. That's where the book says one thing... observation says another...

I don't want to taint your view point, but just don't go with whatever the redbook says, there are other ways to address how work is distributed to the compute nodes. (quantity of slots vs cores)

Are you going with Platform LSF?