Now, as to the problems. First inkling of problems was when a QA person contacted me with an error. We have an MI job that runs 16 instances simultaneously and one had failed with an error message I haven't personally seen or recall being posted:
Code: Select all
JobName.Invocation.Xform Unable to create new process. Will try again.This in a Fatal log message and needless to say the job aborted. The other 15 were allegedly running. When I checked, they had been running for 12 hours and their monitors looked something like this:
This image is actually from later in the trainwreck but the situation is the same - first xform still "starting" with all the rest "running". I killed all of these jobs and recycled the DataStage Server. Next time the 16 were cranked up, the first 8 actually started and the second 8 never got all of the transformers running, looking like the image linked above. Eventually, the first 8 completed but didn't seem to realize it:
All but the first xform finished. It *is* finished but hasn't set the status yet. I went in to start killing processes and nuked the PIDs related to the first invocation. At that time, instance 9 and 10 aborted and 11 through 16 actually started to process rows. I restarted 9 and 10 so that the last 8 invocations are now running.
I have zero confidence that they will all finish like normal and fully expect them to get 'stuck' as well. I know this is alot of rambling but wondering what peoples thoughts are. While typing this up decided to check the &PH& directory and found basically 1000 files there. Will clear it of all extraneous files and see if that helps.
Thanks.

</a>