How to determine if a job is hung?

abbhi · Post by **abbhi** » Fri May 20, 2016 1:45 am

Recently we are facing issues with our DataStage jobs and database server.
Few of our jobs which used to take 1 hr to complete are taking 1.30 and sometimes 3 hours to complete.

A few days back we get the scenario where the job was running for like more than 3 hours it was showing status as running but not processing the data.

I concluded that when I view the monitor for that particular job it was not changing for long time so I concluded that particular stage is not processing the data. so we killed the job and restarted it again.,

My concern is that this is not a reliable way to conclude if a job is hung. This particular job is of low priority and killing it was not a big deal but few jobs cant be killed at any time in production.

I am thinking, please correct me if I am wrong, if I run dsjobinfo command to retrieve the status of any job at any particular point of time and if it doesn't retrieve the result, does that mean job is hung? Or is there any other way to conclude on job hung status..?

I don't want to check the database if its getting updated. I am looking for a way in DataStage.

Thank you

qt_ky · Post by **qt_ky** » Fri May 20, 2016 5:49 am

Do a quick check on your DataStage server to make sure that none of the file systems are 100% full. Assuming that's OK, DataStage is working fine! Move on to the database.

DataStage jobs that hang are almost always waiting on a database lock to clear up. In some cases, such as if your database is Oracle (is it?), Oracle is notorious for causing jobs to hang due to its poor optimizer/statistics handling issue. Anyway, always check the database!

If you cannot check the database or don't have knowledge or permissions, then you must involve your DBA. They can tell you specifically what your job is waiting on (i.e. why it appears to be hung). Let us know what you find out.

kduke · Post by **kduke** » Fri May 20, 2016 3:40 pm

If a job is hung then several things might help figure it out. If you are getting blocked at the database then the DBA can help figure that out. When a UNIX process gets hung then the RAM size stays the same sometimes. If you have a memory leak then it grows until it fails or job runs out of rows. If CPU stays same then that might be an indicator. Another indicator is the number of rows stays the same in job monitor.

Just because the job runs longer is not necessarily a problem. If you have 3 times the number of rows then it should run 3 times longer. If row counts are the same and there is no contention on the DB or UNIX then you might have a problem.

Contention on UNIX.
1. Are you paging? (out of RAM)
2. Are you waiting on disk? (disk bottleneck)
3. Lots of UNIX processes running at same time. (out of CPU?)
4. Network bottlenecks.

Database is a problem.
1. Updating on non key fields. (should be updating on primary keys)
2. Other jobs updating same table?
3. Database backup running? (nothing slows it down more than backups)
4. Have they run table stats recently? (in Oracle this is big problem)
5. Record locks on my table?
6. Table locks on foreign key tables?

This should get you started solving your problem.

UCDI · Post by **UCDI** » Mon May 23, 2016 8:08 am

run the job with a small # of records to see if it works. If it does, you have one of the already mentioned issues (out of memory/disk or stats etc) or a general performance issue of some sort or just a huge job.

- You can tie a peek or an extra file write stage to the job and watch it as it goes to see if it is making progress (just write like 1 byte to a sequential file per record).

- you may be able to watch record counts live in datastage depending on the job type and such.

- you can watch the job in director to see if it is advancing from job to job or stage to stage.

Teej · Post by **Teej** » Mon May 23, 2016 10:33 am

The quickest way to observe the activity of the job under the hood is to use the pstack (Linux) or procstack (AIX) command along with enabling $APT_DUMP_SCORE and $APT_PM_SHOW_PIDS.

$APT_DUMP_SCORE will show you the structure of the job itself - see here:

http://www.ibm.com/support/knowledgecen ... ml?lang=en

$APT_PM_SHOW_PIDS will show you the process ids for the players (and section leaders).

The most common cause for hangs are due to database access. If you look at the processes associated with the database stages, you could divine what is being done. It does require a bit of an understanding of how the code structure is, and is not really documented (considered proprietary by IBM). But you can see a distinct difference when it is accessing internal functions/classes and when it is accessing one of the databases client APIs. If you can see those client functions going (use pstack/procstack a few times on the same PIDs), then chances are that the database is being slow in feeding data. So further reviews by your DBA and system administrator is warranted

If it is truly hung, no matter how often you use pstack/procstack, things will remain mostly static for all processes. That is definitely something to open a PMR ticket with IBM.

-T.J.