Job hang

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Vishvas
Participant
Posts: 34
Joined: Sat Jun 21, 2003 3:52 am

Job hang

Post by Vishvas »

I have a job that reads from a sequential file and load them in to teradata. The job hangs without any error. The only indication is that a coredump (produced by uvsh) is produced at the unix level. The only way is to stop the job and start it again. If you start it again, then the job works properly. It hang occurs irregularly. If you do some change, then the job will run for 2 to 3 weeks. After that there will be hang.

I have searched this forum and I could not find a suitable solution for this problem. I don't have row buffer enabled. The configuration parameters are set properly (like MFILES is set to 200 and NFILE is unix is set much higher). Couls someone help me in solving this issue.

Regards,
Arun
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Arun,

you should talk with your UNIX admin and go through DBX using your uvsh image and the core file. This will let you do a stack and call trace and hopefully find out more about which part of your job is dying. Debugging a core file takes a bit of experience, a lot of patience and some luck but it ought to point you in the right direction.

When a process aborts so that it generates a core file it usually doesn't get around to processing it's interrupt handlers and for a DataStage job it means that it cannot write "I'm finished" or "I've Aborted" to the status file. So your UNIX processes for a job might be gone but the director still shows the job as "running". This is a case where you can use the director option "clear status file" to correct the status information. Another option is to go into the designer and recompile the job, this will also reset the status file entry.
stan_taylor
Charter Member
Charter Member
Posts: 14
Joined: Tue Mar 04, 2003 3:27 pm

Post by stan_taylor »

Arun,

You may have run into a system limit, like memory. Log in with the user id used to start the job, and execute the ulimit command to see what limits have been set for that user. Then, monitor the process to see if it runs into any of those limits. Even if they are unlimited, check memory utilization - a very common reason for the uvsh core dump is hitting the 2Gb memory limit, since DataStage is a 32-bit app. Hope this helps.
Vishvas
Participant
Posts: 34
Joined: Sat Jun 21, 2003 3:52 am

Post by Vishvas »

We are getting a UVSH core dump. But the phantom process still remains and we are able to stop the job using dsjob API. Only one <defunct> process is associated with the phantom process of that job. How is the UVSH and phantom process related. When will the UVSH will creat core dump.

Thanks,
Arun
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Arun,

you have just repeated your initial query with a small addition, so the answer is going to be the same. A "core" dump is a file generated by the UNIX error handler that store the memory image of a process including all the stack and register information. UVSH does not create a core file, it is UNIX that does this upon a certain type of program abort.

You might try doing a "reset" of the job when this error occurs to see if it generates a log entry titled "from previous run"; but I think that in this case you will be be getting additional information.

If you don't have a local person who knows how to use and interpret DBX output it might be best for you to contact your support provider to get assistance on this issue; since from your explanation your job merely calls the Teradata loader without doing processing so the cause is most likely found within the loader stage.

Do you see any pattern in your hang/abort - does it always happen after approx. x minutes or y rows? Can your Teradata DBA see anything in the connection while the error is happening? Can you run a "truss -p {pid}" on the process to see if it is doing any system calls or other activity when the error happens?
Post Reply