Maximum number instances of a DS job at a time

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bman
Participant
Posts: 33
Joined: Wed Oct 10, 2007 5:42 pm

Maximum number instances of a DS job at a time

Post by bman »

Hi

Is there a limit to the number of instance of a job that can be run simultaniously for a multiinstance job ?

I tried running one of my multi instance job first 10 instance at a time, then 30 at a time and then 80 at a time.

No job failure reported when I ran 10 instance at a time.
Couple of instance failed when I ran 30 instance
Around 20 instance failed when I ran 80 instance at a time ?

Director log had the below error for the failed jobs,

Error setting up internal communications (fifo RT_SCTEMP/Jobname.invocationid.fifo
LOCKED STATUS () -1); file is locked

I have checked and confirmed that all invocation ids are unique and the dataset file out put file written by each instance is unique?

Also, I have notice that when I ran ps -ef | grep <datastage user > there werer far too many process under the datastage user. I was expecting the proceses to be equal to the number of instance. But when I checked for the 80 instance test, at one time the number of process for the datastage user was more than 1000 ?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is no DataStage limit to the number of instances of a multi-instance job that can be running at the same time, provided each has a unique invocation ID.

There obviously are resource limitations.

When you include any inter-process communication into the mix, then there seems to be some kind of collision when creating the pipes through which inter-process communication takes place.

This may or may not be a bug. Have you asked your support provider.

If the job aborts and you re-run with the same invocation ID too soon, then the pipe from the previous invocation may still show as being in use.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bkumar103
Participant
Posts: 214
Joined: Wed Jul 25, 2007 2:29 am
Location: Chennai

Post by bkumar103 »

This all depends on the resource availablity like the no of connection to the database, file system etc.
Birendra
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Usually the operating system cannot keep up with this many processes starting at the same time. You can get more without failures if you sleep in between starts.
Mamu Kim
bman
Participant
Posts: 33
Joined: Wed Oct 10, 2007 5:42 pm

Post by bman »

Hi,

That is what I am doing now. Running only 10 instances at a time, and each time new instance needs to started the scrpt gets count of the total number of job instance running and start a new instance only if instance count is less than 10 else it sleeps for a while.

But I am curious on the number of process that is being started with each run of a job. I could see more than 5 process per job ? Any pointer what all are these proces and is the number of such process constant per job or varies depeding on the stages used ?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

There would be a process for each operator per node. So it also depends on the number of stages you have in you job.
How long does each job runs for? If its just for few minutes, try to invoke some sleep command of say 5-10sec set of jobs. And the next set of jobs with 10-15 sec. So that you can kick of in a batches with these time lags.

But the error is due to the read/write issue with the temp pipe into the folder RT_SCTEMP during the job excecution. As Ray suspected, try to avoid inter-process if you have enabled in your job.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

You are overwelming your server. The jobs are failing for a lack of resources. Like Ray said.
Mamu Kim
bman
Participant
Posts: 33
Joined: Wed Oct 10, 2007 5:42 pm

Post by bman »

Hi,
Continuing on this thread,

I just tried running the flow and tried to capture all the process started as part of it by greping the process running under the userid... I can see too many osh process as below,

userid 307528 749948 0 15:19:48 - 0:00 /appl/infoserver/Server/PXEngine/bin/osh -APT_PMsectionLeaderFlag servername 10006 0 30 node1 servername 1209158387.877358.12703a 0

There were more than 20 such process in the server when the interface was running. Any idea what is this process stands for ? Can I reduce these processes without reducing the number of stages in the job ?
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Sure, run a different config file.
Mamu Kim
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

As noted throughout, this is normal behavior for an EE job.....unless you are specifically combining Operators, you could, at highest case, expect to see the number of processes for a job be equal to the number of stages. If you have 10 stages and 10 instances of that job running, now you've spawned somewhere around 100 processes (give or take, depending on job semantics, stages chosen, etc.). You can easily overwhelm your machine in this fashion. With Server jobs, also as noted above, it's a little bit more predictable, but even with passive stages in the middle of a stream, and inter-process row buffering turned on, you can end up with a lot of processes. The machine appears to be overwhelmed.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Ernie

You forgot the number of nodes adds to this.
Mamu Kim
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

You are absolutely right....my paragraph above assumes single node...

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
Post Reply