DS 8.5 Parallel Job hangs creates Phantom job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
eanolan
Premium Member
Premium Member
Posts: 4
Joined: Thu Aug 13, 2015 9:23 am

DS 8.5 Parallel Job hangs creates Phantom job

Post by eanolan »

I am hoping that someone guru's out there might be able to shed some light on a problem which we are seeing. Our support staff for Datastage are at a loss, and we have an unhappy user group.

Below is the background from outside to inside the Datastage job.

Outside of Datastage:
We have a process where users upload an a file under 50mb to an internal webpage. The file is scanned for viruses and then ftp'd across the network to lands in a directory called "input" on a the DS server. A unix script checks the "input" directory and then calls the ds engine and the sequential job which is a controller for the file processing jobs.

Inside Datastage:

Sequence job looks like this:

Job Parameters: Under Job Properties - Parameters - File name (variable passed in ) directory (variable passed in) APT_CONFIG_FILE which is a file that points to a location that has 50 gigs of space and the file has 10 nodes all pointing to that 50 gig space.

First stage of sequence job:

Job Activity (starts a parallel job called Process_File which contains a sequential file stage - transformer (which maps the inputs) - webservice transformer - filter which filters from a status field returned from webservice call - sequential file success to output directory - sequential file exception to exception directory

Second Stage of Sequence Job:

Execute command after Process_File finishes - fires off script that checks the exception directory if a file exists, then returns true

IF True then Third Stage of Sequence Job:
Process_File parallel job runs again, it reprocesses the file three times until all exceptions are done, or three times has been reached. If there are still exceptions then those are written to a Database for users to review and correct.

If False then job finishes.

Outside of Datastage:

After Sequence job is finished then another unix script outside of datastage checks the output directory and moves the file back to a location where the users can download the file they uploaded.

Here is the issue:

This job worked but was not capturing exceptions so I added a filter in the Process_File parallel job. It filters on a status that is returned from the webservice. It all works if the file is smaller such as under 240,000 records, if it's larger then the job runs and stalls.

What we are seeing after we reset the sequence job in "from previous run" is a repeat of this:

From previous run
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468
DataStage Job 1 Phantom 1468

Also, a few files are always written to the &PH& directory and they also contain the above data.

We are banging our heads on this, and want to know if anyone has seen this? Any ideas of where to look for the issue, logs or parameters that might help us figure out why this is stalling? Any thoughts or comments or ideas on how to trouble shoot this would be so helpful, thanks in advance for your comments or suggestions! - Beth
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Welcome!

First I want to note that I've edited your post to replace all occurrences of "sequential job" with "sequence job" as that is the proper terminology. Secondly, which part of the process "stalls" - the parallel job? What do you do in that case, kill it? And does that also abort the Sequence job? Wondering why you are resetting it. Also curious what errors, if any, are in the Parallel job's log?
-craig

"You can never have too many knives" -- Logan Nine Fingers
eanolan
Premium Member
Premium Member
Posts: 4
Joined: Thu Aug 13, 2015 9:23 am

Post by eanolan »

Hi, thanks for editing that for clarity.

Secondly, which part of the process "stalls" - the parallel job? -- The job stalls in completing processing the file. The Sequence job and parallel sit at a status of running for hours and after we monitor the files in the output and see that neither the output file or exception file are growing, and no calls are being made to the back-end from the webservice we stop the job.

What do you do in that case, kill it? Yes we reset to get the from prevoius run and then recompile the jobs

And does that also abort the Sequence job? We stop the sequence job first and that aborts the parallel job

Wondering why you are resetting it -- we reset because we can see from monitoring that nothing is actually running even though the job status says running

Also curious what errors, if any, are in the Parallel job's log? No errors that is what is driving me crazy. It just acts as if it's running no warnings, no errors, and unfortunately parallel jobs don't have the from previous run

Thanks!
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First I'd suggest you not worry about the Sequence job and its phantoms at the moment. It is waiting for the Parallel job to complete so that's normal behaviour. Concentrate on the Parallel job... and my first suspicion would fall on the webservice call, especially since it seems to be volume related. Are you certain that the service you are calling can handle large volumes or a large number of continuous calls to it? Hopefully others can chime in on ways to determine or help test that.

Also I don't recall ever seeing a need to recompile anything in Production, resetting after an abort should be sufficient... or have you found something that requires you to recompile them? More of a curiosity than anything.
-craig

"You can never have too many knives" -- Logan Nine Fingers
eanolan
Premium Member
Premium Member
Posts: 4
Joined: Thu Aug 13, 2015 9:23 am

Post by eanolan »

Just wanted to put a solution in this post as it maybe helpful to others.

I need to elaborate on all the symptoms that we were seeing. We had a unix script running to kick off the job when a file was found in a dir for processing.

If the job was set to finished and the script kicked off we were getting the warning in the logs of "Attempting to Cleanup after ABORT raised in job ..."

And this logging would repeat until the job was compiled. Once it was compiled and the script fired off then the job would kikc off as expected.
The reptitive phantom loggings were a symptom of the abort issue.

What we found and corrected was:

The issue occured when a sequence job was finished as a user and then the dsjob was kicked off via a script with an admin account. Although, the admin account should have power user rights and over ride the user account it was not happening. And since the job was finished as the user the admin account could not start the job.

Once the job was set to finished by the admin account then the job would fire off.

Bascially what it came down to was permissions. So if you are getting an Abort wasrning in a sequence job, try making sure that sequence job is being run with an admin account.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Thanks for coming back and posting the resolution.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply