Maximum number instances of a DS job at a time
Moderators: chulett, rschirm, roy
Maximum number instances of a DS job at a time
Hi
Is there a limit to the number of instance of a job that can be run simultaniously for a multiinstance job ?
I tried running one of my multi instance job first 10 instance at a time, then 30 at a time and then 80 at a time.
No job failure reported when I ran 10 instance at a time.
Couple of instance failed when I ran 30 instance
Around 20 instance failed when I ran 80 instance at a time ?
Director log had the below error for the failed jobs,
Error setting up internal communications (fifo RT_SCTEMP/Jobname.invocationid.fifo
LOCKED STATUS () -1); file is locked
I have checked and confirmed that all invocation ids are unique and the dataset file out put file written by each instance is unique?
Also, I have notice that when I ran ps -ef | grep <datastage user > there werer far too many process under the datastage user. I was expecting the proceses to be equal to the number of instance. But when I checked for the 80 instance test, at one time the number of process for the datastage user was more than 1000 ?
Is there a limit to the number of instance of a job that can be run simultaniously for a multiinstance job ?
I tried running one of my multi instance job first 10 instance at a time, then 30 at a time and then 80 at a time.
No job failure reported when I ran 10 instance at a time.
Couple of instance failed when I ran 30 instance
Around 20 instance failed when I ran 80 instance at a time ?
Director log had the below error for the failed jobs,
Error setting up internal communications (fifo RT_SCTEMP/Jobname.invocationid.fifo
LOCKED STATUS () -1); file is locked
I have checked and confirmed that all invocation ids are unique and the dataset file out put file written by each instance is unique?
Also, I have notice that when I ran ps -ef | grep <datastage user > there werer far too many process under the datastage user. I was expecting the proceses to be equal to the number of instance. But when I checked for the 80 instance test, at one time the number of process for the datastage user was more than 1000 ?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There is no DataStage limit to the number of instances of a multi-instance job that can be running at the same time, provided each has a unique invocation ID.
There obviously are resource limitations.
When you include any inter-process communication into the mix, then there seems to be some kind of collision when creating the pipes through which inter-process communication takes place.
This may or may not be a bug. Have you asked your support provider.
If the job aborts and you re-run with the same invocation ID too soon, then the pipe from the previous invocation may still show as being in use.
There obviously are resource limitations.
When you include any inter-process communication into the mix, then there seems to be some kind of collision when creating the pipes through which inter-process communication takes place.
This may or may not be a bug. Have you asked your support provider.
If the job aborts and you re-run with the same invocation ID too soon, then the pipe from the previous invocation may still show as being in use.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi,
That is what I am doing now. Running only 10 instances at a time, and each time new instance needs to started the scrpt gets count of the total number of job instance running and start a new instance only if instance count is less than 10 else it sleeps for a while.
But I am curious on the number of process that is being started with each run of a job. I could see more than 5 process per job ? Any pointer what all are these proces and is the number of such process constant per job or varies depeding on the stages used ?
That is what I am doing now. Running only 10 instances at a time, and each time new instance needs to started the scrpt gets count of the total number of job instance running and start a new instance only if instance count is less than 10 else it sleeps for a while.
But I am curious on the number of process that is being started with each run of a job. I could see more than 5 process per job ? Any pointer what all are these proces and is the number of such process constant per job or varies depeding on the stages used ?
There would be a process for each operator per node. So it also depends on the number of stages you have in you job.
How long does each job runs for? If its just for few minutes, try to invoke some sleep command of say 5-10sec set of jobs. And the next set of jobs with 10-15 sec. So that you can kick of in a batches with these time lags.
But the error is due to the read/write issue with the temp pipe into the folder RT_SCTEMP during the job excecution. As Ray suspected, try to avoid inter-process if you have enabled in your job.
How long does each job runs for? If its just for few minutes, try to invoke some sleep command of say 5-10sec set of jobs. And the next set of jobs with 10-15 sec. So that you can kick of in a batches with these time lags.
But the error is due to the read/write issue with the temp pipe into the folder RT_SCTEMP during the job excecution. As Ray suspected, try to avoid inter-process if you have enabled in your job.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Hi,
Continuing on this thread,
I just tried running the flow and tried to capture all the process started as part of it by greping the process running under the userid... I can see too many osh process as below,
userid 307528 749948 0 15:19:48 - 0:00 /appl/infoserver/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag servername 10006 0 30 node1 servername 1209158387.877358.12703a 0
There were more than 20 such process in the server when the interface was running. Any idea what is this process stands for ? Can I reduce these processes without reducing the number of stages in the job ?
Continuing on this thread,
I just tried running the flow and tried to capture all the process started as part of it by greping the process running under the userid... I can see too many osh process as below,
userid 307528 749948 0 15:19:48 - 0:00 /appl/infoserver/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag servername 10006 0 30 node1 servername 1209158387.877358.12703a 0
There were more than 20 such process in the server when the interface was running. Any idea what is this process stands for ? Can I reduce these processes without reducing the number of stages in the job ?
As noted throughout, this is normal behavior for an EE job.....unless you are specifically combining Operators, you could, at highest case, expect to see the number of processes for a job be equal to the number of stages. If you have 10 stages and 10 instances of that job running, now you've spawned somewhere around 100 processes (give or take, depending on job semantics, stages chosen, etc.). You can easily overwhelm your machine in this fashion. With Server jobs, also as noted above, it's a little bit more predictable, but even with passive stages in the middle of a stream, and inter-process row buffering turned on, you can end up with a lot of processes. The machine appears to be overwhelmed.
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
You are absolutely right....my paragraph above assumes single node...
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>