Maximum number instances of a DS job at a time

bman · Post by **bman** » Wed Apr 16, 2008 9:45 pm

Hi

Is there a limit to the number of instance of a job that can be run simultaniously for a multiinstance job ?

I tried running one of my multi instance job first 10 instance at a time, then 30 at a time and then 80 at a time.

No job failure reported when I ran 10 instance at a time.
Couple of instance failed when I ran 30 instance
Around 20 instance failed when I ran 80 instance at a time ?

Director log had the below error for the failed jobs,

Error setting up internal communications (fifo RT_SCTEMP/Jobname.invocationid.fifo
LOCKED STATUS () -1); file is locked

I have checked and confirmed that all invocation ids are unique and the dataset file out put file written by each instance is unique?

Also, I have notice that when I ran ps -ef | grep <datastage user > there werer far too many process under the datastage user. I was expecting the proceses to be equal to the number of instance. But when I checked for the 80 instance test, at one time the number of process for the datastage user was more than 1000 ?

ray.wurlod · Post by **ray.wurlod** » Wed Apr 16, 2008 10:21 pm

There is no DataStage limit to the number of instances of a multi-instance job that can be running at the same time, provided each has a unique invocation ID.

There obviously are resource limitations.

When you include any inter-process communication into the mix, then there seems to be some kind of collision when creating the pipes through which inter-process communication takes place.

This may or may not be a bug. Have you asked your support provider.

If the job aborts and you re-run with the same invocation ID too soon, then the pipe from the previous invocation may still show as being in use.

bkumar103 · Post by **bkumar103** » Thu Apr 17, 2008 1:04 am

This all depends on the resource availablity like the no of connection to the database, file system etc.

kduke · Post by **kduke** » Thu Apr 17, 2008 8:27 pm

Usually the operating system cannot keep up with this many processes starting at the same time. You can get more without failures if you sleep in between starts.

bman · Post by **bman** » Fri Apr 18, 2008 12:11 pm

Hi,

That is what I am doing now. Running only 10 instances at a time, and each time new instance needs to started the scrpt gets count of the total number of job instance running and start a new instance only if instance count is less than 10 else it sleeps for a while.

But I am curious on the number of process that is being started with each run of a job. I could see more than 5 process per job ? Any pointer what all are these proces and is the number of such process constant per job or varies depeding on the stages used ?

kumar_s · Post by **kumar_s** » Fri Apr 18, 2008 12:22 pm

There would be a process for each operator per node. So it also depends on the number of stages you have in you job.
How long does each job runs for? If its just for few minutes, try to invoke some sleep command of say 5-10sec set of jobs. And the next set of jobs with 10-15 sec. So that you can kick of in a batches with these time lags.

But the error is due to the read/write issue with the temp pipe into the folder RT_SCTEMP during the job excecution. As Ray suspected, try to avoid inter-process if you have enabled in your job.

kduke · Post by **kduke** » Fri Apr 18, 2008 4:32 pm

You are overwelming your server. The jobs are failing for a lack of resources. Like Ray said.

bman · Post by **bman** » Fri Apr 25, 2008 3:47 pm

Hi,
Continuing on this thread,

I just tried running the flow and tried to capture all the process started as part of it by greping the process running under the userid... I can see too many osh process as below,

userid 307528 749948 0 15:19:48 - 0:00 /appl/infoserver/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag servername 10006 0 30 node1 servername 1209158387.877358.12703a 0

There were more than 20 such process in the server when the interface was running. Any idea what is this process stands for ? Can I reduce these processes without reducing the number of stages in the job ?

kduke · Post by **kduke** » Fri Apr 25, 2008 8:25 pm

Sure, run a different config file.

eostic · Post by **eostic** » Sat Apr 26, 2008 9:10 am

As noted throughout, this is normal behavior for an EE job.....unless you are specifically combining Operators, you could, at highest case, expect to see the number of processes for a job be equal to the number of stages. If you have 10 stages and 10 instances of that job running, now you've spawned somewhere around 100 processes (give or take, depending on job semantics, stages chosen, etc.). You can easily overwhelm your machine in this fashion. With Server jobs, also as noted above, it's a little bit more predictable, but even with passive stages in the middle of a stream, and inter-process row buffering turned on, you can end up with a lot of processes. The machine appears to be overwhelmed.

Ernie

kduke · Post by **kduke** » Sat Apr 26, 2008 1:22 pm

Ernie

You forgot the number of nodes adds to this.

eostic · Post by **eostic** » Sat Apr 26, 2008 3:29 pm

You are absolutely right....my paragraph above assumes single node...

Ernie