RTI job instances do no die

Dedicated to DataStage and DataStage TX editions featuring IBM<sup>®</sup> Service-Oriented Architectures.

Moderators: chulett, rschirm

Post Reply
luie
Premium Member
Premium Member
Posts: 16
Joined: Sun Jan 25, 2004 3:48 pm

RTI job instances do no die

Post by luie »

We have an RTI job that is "Always On" that runs on a minimum of 4 instances, maximum of 10. The job design is
RTI Input -----> Stages --------> RTI Output

Service requests as of now is very low, up to 2500 per day on small bursts every 5 minutes. The problem we have now is, there are job instances that would start but would never die - their Time-To-Live is 3 hours and 2 hours when idle. Director status shows they are running but there are no detailed logs. Unix processes are running for those instances.

Does anybody have an idea why this is happening? The reason why we have 4 minimum instances is to anticipate increase in requests. When we had our minimum instance set to 1 previously, the job times out and no new instance was ever spawned.

Would really appreciate your input.

Thanks!
Last edited by luie on Tue Jul 10, 2007 9:38 pm, edited 1 time in total.
luie
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Hi....

What is your min setting? Time to Live is the maximum time for any "particular instance" to stay running...but it is separate from the min number of instances. If min is 4, and you have time to live as 3 hours, at the end of three hours, THAT instance will be killed, and then immediately re-started to meet the min of 4. If the min is 0, and then when the system starts it increases instances to 4, then when instances are taken down (by this or by the reclaim time-out), it will go back to zero, but that's a different concept. Minimun instances are always maintained, by design.

Ernie
luie
Premium Member
Premium Member
Posts: 16
Joined: Sun Jan 25, 2004 3:48 pm

Post by luie »

The minimum instance is 4. Prior to this, we had the minimum set to 1 but at that setting, the client application receives time out and no new job instance was ever created. We felt that the only way to get around the "failure to spawn new instances" was to set the minimum to more than 1.

The jobs are running and there are no time outs now. However, once in a while, job instances wouldn't die. At one time, we had 10 instances running-2 instances were 5 days old and 2 were two days old. Their Time To Live is 3 hours.

When we disable the job from the console, the old processes are not stopped. Only four active processes are stopped. So somehow, somewhere, RTI Agent lost control of the other instances and we don't know where. DataStage Director does not have detailed logs for those "run-away" job instances.
luie
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Ok... there are multiple things going on here. I can't say why min 1 ever would have come down to 0 because of time-to-live. If it did, it sounds like an old anomaly...... and orphaned job instances are possible, although I've generally only seen such things during development when changes are being made, jobs started and stopped in ways other than the console, various job aborts because logic isn't worked out, etc. What exact release of RTI are you using? Especially important is the final number of 7.5.2rX.

Are these Server or EE jobs? Anything interesting in them? Database lookups, QualityStage, etc.? I suspect that the extra instances never receive any new rows.

Ernie
luie
Premium Member
Premium Member
Posts: 16
Joined: Sun Jan 25, 2004 3:48 pm

Post by luie »

Thanks Ernie for the reply.

This is an EE job (7.5x2) with QualityStage plugin, transformations and 2 DB2 lookups. Even if the extra instances never receive rows, shouldn't there be logs like "Starting job ...." and so on. There was nothing in the log though; they were empty for those orphaned job instances.

Jar file shows RTI implementation is v7.5.2.0.
luie
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

ah. The infamous x2. 7.5x2 is/was a special release of EE for Windows, and never formally supported RTI (or a lot of other things, like some of the packs). To be honest, I am surprised that it's worked this well for you up till now. Generally, Jobs don't even deploy on it. No guarantees, as this issue might be 100% unrelated, but that may well be at the core of your conditions. At this point both 7.5x2 and 7.5.2r0 of RTI are very old releases. Not sure how best to advise you on this one. How tolerable are the orphans and how frequently do they occur? Do you have other Jobs as Services that work successfully? Are any of them Server Jobs?

Ernie
Post Reply