Strange Server errors
Posted: Sun Aug 17, 2008 9:16 pm
Have something odd going on, with errors and situations going on I've never seen before. These are issues with old jobs on a secondary QA server, jobs that run fine in 3 other environments but haven't been run on this particular box for about a month. And no, I'm not aware of anything that may have changed but I'm gonna try and find out. Keep in mind as you read this there is no row buffering enabled in any of the jobs nor the project defaults.
Now, as to the problems. First inkling of problems was when a QA person contacted me with an error. We have an MI job that runs 16 instances simultaneously and one had failed with an error message I haven't personally seen or recall being posted:
This in a Fatal log message and needless to say the job aborted. The other 15 were allegedly running. When I checked, they had been running for 12 hours and their monitors looked something like this:
This image is actually from later in the trainwreck but the situation is the same - first xform still "starting" with all the rest "running". I killed all of these jobs and recycled the DataStage Server. Next time the 16 were cranked up, the first 8 actually started and the second 8 never got all of the transformers running, looking like the image linked above. Eventually, the first 8 completed but didn't seem to realize it:
All but the first xform finished. It *is* finished but hasn't set the status yet. I went in to start killing processes and nuked the PIDs related to the first invocation. At that time, instance 9 and 10 aborted and 11 through 16 actually started to process rows. I restarted 9 and 10 so that the last 8 invocations are now running.
I have zero confidence that they will all finish like normal and fully expect them to get 'stuck' as well. I know this is alot of rambling but wondering what peoples thoughts are. While typing this up decided to check the &PH& directory and found basically 1000 files there. Will clear it of all extraneous files and see if that helps.
Thanks.
Now, as to the problems. First inkling of problems was when a QA person contacted me with an error. We have an MI job that runs 16 instances simultaneously and one had failed with an error message I haven't personally seen or recall being posted:
Code: Select all
JobName.Invocation.Xform Unable to create new process. Will try again.This in a Fatal log message and needless to say the job aborted. The other 15 were allegedly running. When I checked, they had been running for 12 hours and their monitors looked something like this:
This image is actually from later in the trainwreck but the situation is the same - first xform still "starting" with all the rest "running". I killed all of these jobs and recycled the DataStage Server. Next time the 16 were cranked up, the first 8 actually started and the second 8 never got all of the transformers running, looking like the image linked above. Eventually, the first 8 completed but didn't seem to realize it:
All but the first xform finished. It *is* finished but hasn't set the status yet. I went in to start killing processes and nuked the PIDs related to the first invocation. At that time, instance 9 and 10 aborted and 11 through 16 actually started to process rows. I restarted 9 and 10 so that the last 8 invocations are now running.
I have zero confidence that they will all finish like normal and fully expect them to get 'stuck' as well. I know this is alot of rambling but wondering what peoples thoughts are. While typing this up decided to check the &PH& directory and found basically 1000 files there. Will clear it of all extraneous files and see if that helps.
Thanks.