Page 1 of 2

Fixing corrupted log files on reboot

Posted: Mon Mar 08, 2004 3:25 am
by roy
Hi All,
I've performed a search for corrupted log files, but I would like a more coplete picture on the issue from the experianced experts here.
In 1 of the projects I'm involved in at this time, there are some reboots on the DS Server macihine due to power failiour.
I know and agree that this situation is not acceptable, but it seems I'll have to live with this for a while till they fix it.
Now the problem is when a power failiour occurs and a DS job/s is/are running the logs get corrupted and need to be fixed before another run is made, also the status of jobs need to be reset.
I want, to give a temporary solution, till the power issues are resolved, to automatically fix all the log files and status files on startup.

AFAIK, I need some select NAME,JOBNO from DS_JOBS and use uvfixfile.exe on the RT_LOG<JOBNO> foreach job (except myself)
and also preform a CLEAR.FILE RT_STATUS<JOBNO>.
the question is: do I need somethign else, or is this enough?

My plan it to build something that will be imbeded in the system startup, after DS services are up and perform this operation before any further run of regular DS jobs is made.

any insight on this would be apreciated :)

Thanks in advance,

Posted: Mon Mar 08, 2004 10:07 am
by roy
Hi,
I was wondering, if I'm not interested in log history, would a CLEAR.FILE to the RT_LOG## be enough :?:

Thanks in advance (again),

Posted: Mon Mar 08, 2004 3:32 pm
by ray.wurlod
Which hashed files in the repository you need to check depends on which hashed files were being written when the power failed. You're right that the most likely candidates will be the log files and the status files. But, if development work was being done at the time, there's also the config files and the DS_... files to check.
Clearing the files should eliminate any corruption caused by an interrupted write but you do lose information.
You can check for corruption in hashed files using uvfixfile or fixtool from the operating system command line, so that you know which ones need possible repair.

Posted: Mon Mar 08, 2004 5:43 pm
by kduke
You can also corrupt the DICT side of RT_LOGxx because it keeps the next id.

Posted: Mon Mar 08, 2004 8:08 pm
by ray.wurlod
No it doesn't. The next event number is kept in a control record called //SEQUENCE.NO in the data portion.
There are two other control records in a log file; //PURGE.SETTINGS and //JOB.STARTED.NO, which is why it's never a good idea to use CLEAR.FILE on DataStage logs.

Posted: Tue Mar 09, 2004 9:37 am
by roy
Hi,
Thanks guys :).
actually there is no developlment done there it's a production system.
the things is that I have 1 controll job and it runs sequence and server jobs that are multi instance.
when power failiour occurs log files get corrupted.
and after the machine goes back up it disrupts the normal flow of things and messes it all up.
would Ken's compile routine be more efective in this case?
I got some conflicting answers about mullti instance jobs and their RT_LOG & RT_STATUS files.
since only jobs that were running while the power went down are candidates for this problem I thought of checking for thier status, but then thought that it might be more simple to compile the main multi instance job since all together there are 40 or so jobs that run in multiple instances.
do you have any tips on multi instances jobs handling in this case?
Thanks,

Posted: Tue Mar 09, 2004 9:52 am
by kcbland
A hard crash that you are describing is tricky to programmatically recover. Who watches the watcher? If the main controlling job itself crashes, corrupting its log, status, and config files, then how does that get automatically rectified?

I think in the event of a catastraphic failure, such as a reboot during a run, you should simply sweep the system. I hope you see the wisdom behind building an ETL application that stages load ready data, and defers all loads until transforms are done and then can simply load the result sets. Not only is this easier to do, amenable to bulk loading, restarts, etc, but it also won't leave your target in a semi-updated state that is more difficult to recover. That being said, I think you should get your hands on a programmatic recompile tool and recompile all of your jobs. You mentioned using one supplied by me.

In either case, you should consider a system wide log purge using a "CLEAR.FILE", as well as that doesn't do a programmatic remove, but more like a "cat /dev/null > file" type operation. Your log purge setting row is actually comingled with log data, so if the log is corrupted this setting is unrecoverable anyway. I have a utility for mass setting the auto-log purge setting if you are interested. If I were you, I'd write a Batch job to clear the status and config file for every job as well. So to recap, a utility Batch job that sweeps all jobs and clears their status, log, and config files. Then, get the log purge setting utility I mentioned to mass set the lost purge settings again.

I hope you also see why it is paramount to track job execution history outside DataStage, as its own internal logging structures are sensitive not only to hard system crashes and corruptions, but if you run the project out of disk space the result is the same as if you kicked the power plug out.

Posted: Tue Mar 09, 2004 10:03 am
by roy
Thanks Ken,
actually my job should have no problems in 97% of cases to rerun again and the remainig 3% I have a job that reporocesses everything in 2 hours or so.
I want to get something clear, if I may, in case a RT_LOG file is corrupted will a CLEAR.FILE on it and the RT_STATUS do the work?
and another thing, AFAI understood the RT_LOG file is shared by multi instances is it so? and the RT_STATUS as well.
(I rather sound dumb or stupid and get it 100% right then have a 1% doubt and fail in my task ;))

Thanks again,

Posted: Tue Mar 09, 2004 3:03 pm
by ray.wurlod
CLEAR.FILE on RT_LOG and RT_STATUS is highly likely to clear any logical corruption. This is not, however, 100% guaranteed, and definitely is not guaranteed to fix any physical corruption (for example bad spot on disk).

It also means you lose the control records. DataStage will re-create the control records in the log file as needed, but you will lose any job-specific purge settings.

To automate this checking process on re-boot you should create it as a BAT file and organize for it to execute once DataStage has re-started.

The one RT_LOGnn file (hashed file) is shared by all instances of the job.

Posted: Tue Mar 09, 2004 3:11 pm
by roy
Thanks Ray,
for some reason my support provider said Ascential says there are seperate RT_LOG files where multi instance is concerned so thought STATUS as well, I remember you mentioning this once or more here.
since Ascential said so, as I gather I'll check this on a new, clean project ASAP.

Thanks again :),

Posted: Tue Mar 09, 2004 3:25 pm
by ray.wurlod
The advice you have received is, quite simply, wrong.

You get different views (one per instance) in the Director log view. This may have confused your support provider. (If your support provider can't understand the concept of a view, maybe you have another problem!)

However, there is only one RT_LOGxx file for all instances.

Posted: Tue Mar 09, 2004 6:26 pm
by kduke
Roy

The files are the same. It does keep separate records for each instance. There is a field in RT_LOG which tells which instance it belongs to. RT_STATUS is more complicated. If you need to know I can look it up in my code for DsWebMon.

Posted: Thu Mar 11, 2004 2:38 pm
by roy
Thanks :),
I tested this on a clean project and as Ray said there is only 1 phisical RT_... set of files for a multiple instance job :).
(tested on version 6)

Posted: Thu Mar 11, 2004 6:31 pm
by kduke
That is what I was trying to say. It separates the instances in the data of each record either in the key values or in a field.

Posted: Mon Mar 15, 2004 10:35 am
by roy
Hi,
Just Wanted to fill you in.
the CLEAR.FILE for the RT_LOGnn & RT_STATUSnn files
did the job :)
Thanks All,