Tools for analysis and reporting of Ab-ends of batch jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
G30ff
Participant
Posts: 1
Joined: Thu Jul 07, 2005 7:12 pm

Tools for analysis and reporting of Ab-ends of batch jobs

Post by G30ff »

Hi All -

We are getting several ab-ends on some of our ETL jobs and some in Validation, causing delays on other jobs, queues and timeouts.

Some jobs have auto re-run, some don't.
The job logs are not very desciptive of the failures and will take a very long time to go through manually to discover probable cause. There used to be a lot more stats included in the runs, but for some reason these have been removed (we are putting them back in but that wont fix in the short term).

So, we'd like a tool to get through the logs, analyse what is happening and report on the failures.

Has anyone any experience of tools in this space and how effective they are?

Thanks

Geoff.
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

In the menu at the top of this screen is a Search capability. You should be able to find answers among the more than 50000 posted already.

One suggestion is to open Director, disable display of Categories, and set a Filter so that only Fatals are displayed. Ctrl-T to open the Filter dialog in Status view.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

This is typical of DataStage sites that have been going for a while and have had multiple project teams cycling through. Each team has a different way of doing things and you end up with a mix of server - parallel, restart - no restart, error reporting - no error reporting.

I've seen sites go from two or three DataStage support staff fighting fires almost every day down to one part time support staff member who rarely has to troubleshoot.

First I would start gathering ETL operational metadata into some tables for convenient reporting. The ETL Stats can be downloaded from Ascential devnet and if it is run each day it will store in a table the status of all jobs, whether the worked or failed or produced warnings.

Second I would put in a message handler and promote all known and excepted parallel job warnings up to information messages. This removes all the warning noise of parallel jobs and lets you find real warnings that need to be investigated.

Next I would use a basic routine that pulled out error and warning messages and delivered them to support staff after each batch load. The DSGetLogSummary command retrieves log messages and can be filtered to get error and warning messages for a particular job execution. It will work for both parallel and server jobs. This routine can be executed after every job to retrieve for that job or can be run once a day to loop through the complete job list.

Next would be individual job investigations. If you are getting unexpected and unpredictable aborts it could be due to overloading. Don't run large parallel jobs at the same time. Try to work out if aborted jobs are running at the same time as other jobs. Have a look at the job design for inefficiencies. Do some monitoring at the server level for I/O, RAM usage and deadlocks.
Post Reply