Grid Performance Testing

bobyon · Post by **bobyon** » Fri Mar 29, 2013 3:36 pm

I've been tasked with developing some instructional guidelines for our development teams regarding how many nodes and partitions to request for various types of jobs. I know this is as much art as science, but I thought I should start with gathering some information first.

I started just trying to see how much performance was affected by changing from 2 to 4 to 8 nodes, etc. So, to start with I used a row generator to create a fairly large dataset with 10 or 15 columns with a variety of data types and lengths.

I then created read this dataset into a dummy job that simply sorts the data 3 different ways and hash partitions the data before each sort stage and then a TX that simply multiplys by 2 any decimal columns and writes to another dataset.

When running this second job (with 1 node and 2 partitions) multiple times in a row, I expected to see very similar run times that I could use as a baseline. However, that is not the case at all I am seeing a range of run times between 6 mins and 11.5 minutes; startup times range from 1 to 4 seconds.

I pretty much have this little test system all to myself for this testing. It has 3 compute nodes each with 8 processors and tons of memory and plenty of free space. Resource manager is Load Leveler.

Any ideas why I am not seeing more consistent run times for a job that runs multiple times reading unchanged data?

Any words of wisdom on how to approach this kind of testing to determine what, if any, performance improvements are gained by increasing (or decreasing) nodes/partitions.

I anxiously await your perspicuous counsel.

Thanks,
Bob

PaulVL · Post by **PaulVL** » Sat Mar 30, 2013 10:49 am

I did the same approach in my old environment. You'll have to take an average runtime approach. run each test 10 times (at least) back to back.

Get your runtime, and startup time cost.

You'll have to look in the detailed log file to calculate the real startup cost.

There's a slight bug in the log printing in that the timestamp of each entry is not actually the timestamp of the event happening, it's the timestamp of the event being logged (which is different). There is another slight oddity that can be viewed as ... there is a print buffer that sometimes only gets flushed upon the next line of information being sent to be printed.

I say this because it's actually useful for the purpose of seeing what your try startup time was (which includes connectivity to external databases and extraction of rows).

Long story short, look at your detailed job log, look for the entries that say Node1 started... Node2 started... and you'll see that the last one seems to be delayed as compared to the prior bunch. It's not really delayed to start up, it's just the log entry that is delayed in printing. It gets printed upon the next valid log entry. Thus.... the time between when your job first started (first entry in log) to the time the last NodeX got printed, is your actual true startup cost (factoring in grid wait time, osh startup and database connections).

So now you have job start time, job WORK start time (job start time - job startup cost time), job end time, wait time in queue.

I think that will provide the answers you seek. And yes... job execution time does vary, that's why you want to average out the times across multiple executions.

Files could be cached on your NAS, other folks could be chewing up network (low probability), etc...

Test with a job that runs at least 10 mins (20 would be better). Throw gigs of row_gen data at it.

lstsaur · Post by **lstsaur** » Sat Mar 30, 2013 9:25 pm

Q. Any ideas why I am not seeing more consistent run times for a job
that runs multiple times reading unchanged data?

I would suspect the problem is that the routing of your job's activity occurred on the public network rather than the desired private network. So you can verify that by checking the node translation process.

You should definitely get a very consistent run time if the job is processed on the private network. Is your "little test system" SAN-based or NAS-based configuration grid environment?
Do compute nodes have multiple NIC cards? Having mutiple network switches with MOU enabled?

Q. Any words of wisdom on how to approach this kind of testing to determine what, if any, performance improvements are gained by increasing (or decreasing) nodes/partitions.

There is no "magic" way to determine how many nodes/partitions to use for a particular job, but I would still recommand the following steps based on my past experiences:

1. Start with one compute node with one partition for a particular job. Then
use this setting as a base line performance.

2. Increase the number of partitions until the CPUs on the node are fully used.

3. Establish whether the runtime performance is satisfactory (that means you see improved performance over the base line performance). If the runtime performance is not satisfactory, then add another compute node for the particular job.

4. Must measure the CPU use of the compute nodes assingned to a particular job.
Don't know whether LoadLevel provides this kind of resource monitoring capability?

5. If the job running requires additional memory, consider using additional compute nodes (partitions per node must decrease). Or not enough scratch space on a single node for the sorts required, then add another compute node.

6. Most PX developers don't realize that requesting resources from the resource manager may sometimes constitute a significant portion of a job's execution time. Especially for a long series of jobs, the cost of requesting resources can add up to hours.
So it doesn't matter how you increasing or decreasing nodes/partitions, the job just sucks!
However, the Grid Enablement Toolkit does provide a script(sequencer.sh) that can be used to interact with the resource manager to solve this kind of problems.

bobyon · Post by **bobyon** » Mon Apr 01, 2013 8:37 am

lstsaur wrote:Q. Any ideas why I am not seeing more consistent run times for a job
that runs multiple times reading unchanged data?

I would suspect the problem is that the routing of your job's activity occurred on the public network rather than the desired private network. So you can verify that by checking the node translation process.

How do I check the node translation process?

You should definitely get a very consistent run time if the job is processed on the private network. Is your "little test system" SAN-based or NAS-based configuration grid environment?

NAS based

Do compute nodes have multiple NIC cards?

Yes, multiple NICs

Having mutiple network switches with MOU enabled?

I have no idea? How can I tell?

bobyon · Post by **bobyon** » Mon Apr 01, 2013 8:39 am

PaulVL wrote:I did the same approach in my old environment. You'll have to take an average runtime approach. run each test 10 times (at least) back to back.
.
.
.
Throw gigs of row_gen data at it.

Thanks for the input. I'll throw some more data at it and see how it goes.

lstsaur · Post by **lstsaur** » Mon Apr 01, 2013 1:51 pm

A node_table file under the $GRIDHOME/ directory should have the public network IP addresses and the private network IP addresses translations defined for all compute nodes.