Problem with RTI jobs

AccentureTA · Post by **AccentureTA** » Thu Dec 27, 2007 2:21 pm

We have the following set up. Our online application is bound with Datastage RTI jobs through Text over JMS bindings. Each RTI job has its own input/output destinations and corresponding MDBs deployed in WebSphere.

Recently we had the following issue -

Listener port (in WebSphere) corresponding to the input Queue for job A went down with the following error -
CNTR0020E: Non-application exception occurred while processing method "invokePartial" on bean "BeanId(RTIServer#ejb.jar#com.ascentialsoftware.rti.server.call.RTICall, 116e796a3f2)". Exception data: javax.ejb.EJBException: Exception trying to invoke operation Job A: Job 1198504265569_PRC.1198504265569 aborted.

However when we checked Datstage director, the instance number mentioned in the above error was associated with a different RTI job jobB and this job was continuously aborting (even though its original running instance was still running). Job A had not aborts. We were unable to stop this aborts even after disabling Job A & B through the RTI Console.

The RTIAgent also had the following entries for jobB during the time of the above issue -

RTIAgent.log.1:2007-12-24 11:14:47,025 [1198504265569_PRC] ERROR com.ascentialsoftware.rti.agent.handler.datastage.PipeReceiver - [
1198504265569_PRC]error during call to init: dspipe_init(1127240): open(/tmp/ade.ProdRateEngine.jobB.119850426556
9.rtiOutput) - A file or directory in the path name does not exist.
Any ideas/help anyone?

Any inputs would be greatly appreciated!

Thanks
Anand

eostic · Post by **eostic** » Thu Dec 27, 2007 3:20 pm

Is it a new application (mdb jar), or one that was just recently deployed? Just guessing at the moment, but some generic searches on the "non-application exception" in WAS always seem to come up with EAR or JAR configuration/installation ssues.....

Does this one always abort, or is it random?

Does removal of the application and re-deployment have any impact?

Ernie

AccentureTA · Post by **AccentureTA** » Thu Dec 27, 2007 3:40 pm

Thank you for responding.

eostic wrote:Is it a new application (mdb jar), or one that was just recently deployed? Just guessing at the moment, but some generic searches on the "non-application exception" in WAS always seem to come up with EAR or JAR configuration/installation ssues.....

Does this one always abort, or is it random?

Does removal of the application and re-deployment have any impact?

Ernie

No, it is neither new nor recently re-deployed. This job (job B) was continuously aborting (evry minute or so) and we had to process the messages in the queue for jobA (via another Datastage server, isolating this datastage server by stopping its RTIAgent) in order for the aborts to stop.

eostic · Post by **eostic** » Thu Dec 27, 2007 5:52 pm

Ok...then some more generic questions and things to think about....something must have changed --- let's try and figure out what it is/was, and/or trigger some thoughts for you ......Is JobB doing any work when it aborts? (are there rows running thru it, or is it having trouble "starting").....

This is an EE job...any changes in config files? Do _any_ of the instances stay up and running?

If not, why does it abort? What's in the DS logs as far as the abort is concerned (they may be related or may not be.....the WAS error might just be a consequence). How many messages are in the queue when it aborts? Does it abort if/when there are no messages in the queue?

Ernie

eostic · Post by **eostic** » Thu Dec 27, 2007 5:54 pm

Ok...then some more generic questions and things to think about....something must have changed --- let's try and figure out what it is/was, and/or trigger some thoughts for you ......Is JobB doing any work when it aborts? (are there rows running thru it, or is it having trouble "starting").....

This is an EE job...any changes in config files? Do _any_ of the instances stay up and running?

If not, why does it abort? What's in the DS logs as far as the abort is concerned (they may be related or may not be.....the WAS error might just be a consequence). How many messages are in the queue when it aborts? Does it abort if/when there are no messages in the queue?

Ernie

AccentureTA · Post by **AccentureTA** » Thu Dec 27, 2007 6:07 pm

Yes, I am pretty sure the WAS error was a consequence of the DS abort. The problem here is the listener port corresponding to jobA went down and the instance # matches with jobB.

The aborting instances of jobB do not perform any work when it aborts. It aborts due to the following errors:
main_program: File archive: Trouble creating file "/apps2994/scratch/APTcs950656148e4cbd"

The problem here is the filesystem mentioned above does not belong to the server in which the job is running. The RTIconsole has been correctly configured for the scratch space and the original running instances inherited the right parameter for scratch space. However, the aborted instances had the wrong parameter for the scratch space leading to the above error.

Also, during all this time, the original running instance of jobB never aborted and was doing it normal work and so was jobA.

The no.of messages in the queue was varying, since it is an online app and the aborts of job B stopped only after the messages in the queue were processed by sending them to jobA in another datastage server (We have 4 WAS and 4 Datastage servers in this environment).

eostic · Post by **eostic** » Thu Dec 27, 2007 6:34 pm

Hmm.. Quite bizarre that the queue for one would indicate the instance ID of another..... let's look at some other things... are both Operations in the same Service/jar ? If so, it might be a good idea to try redeploying with separate Service, separate jar, each with single Operation (with their own Queue definitions), at least until the issue is determined.

Ernie

AccentureTA · Post by **AccentureTA** » Thu Dec 27, 2007 6:35 pm

Yes... it is a very weird problem. These jobs each have separate services/operations and MDBs.

eostic · Post by **eostic** » Fri Dec 28, 2007 8:43 am

Well, something changed. I'd probably try things such as "stopping" (at the WAS console) one jar or the other, and see if the pattern still occurs, or if Job B continues to fail if it is the only job that has a working MDB, and/or if it is the only job that is "enabled". Also, review the job parameter default values at the RTI console.... change them. Perhaps something was altered in the meta data for the fixed job parms.

Does the error reproduce if you entirely re-build and re-deploy the jar?

Ernie

AccentureTA · Post by **AccentureTA** » Fri Dec 28, 2007 10:17 am

eostic wrote:Well, something changed. I'd probably try things such as "stopping" (at the WAS console) one jar or the other, and see if the pattern still occurs, or if Job B continues to fail if it is the only job that has a working MDB, and/or if it is the only job that is "enabled". Also, review the job parameter default values at the RTI console.... change them. Perhaps something was altered in the meta data for the fixed job parms.

Does the error reproduce if you entirely re-build and re-deploy the jar?

Ernie

Unfortunately, this being our Production envt, we dont have the flexibility to try out some of the things suggested. However, the weird behavior stopped once the batch of messages were processed through a different datastage server. Also, during the issue, we did confirm that the properties in the RTI console were set per requirement and they were correct. Also, the original running instances did inherit the correct properties from the RTI console. We will also be opening a ticket with IBM on this. Please do continue to provide your valuable inputs! Thank you very much for your time and efforts.

eostic · Post by **eostic** » Fri Dec 28, 2007 1:59 pm

Ok...good luck. One question, based on your notes...it sounds like this particular Operation has two Jobs attached? (ie.. Job B on Server "X" and also on Server "Y")? That's a great failover and load balancing scenario, but is more complicated for debugging.

Ernie

AccentureTA · Post by **AccentureTA** » Fri Dec 28, 2007 2:09 pm

eostic wrote:Ok...good luck. One question, based on your notes...it sounds like this particular Operation has two Jobs attached? (ie.. Job B on Server "X" and also on Server "Y")? That's a great failover and load balancing scenario, but is more complicated for debugging.

Ernie

Yes, we have about 18 operations, each having 4 jobs (4 different servers) attached to them.