Job Sequence checkpoints - When are they lost ?

cdp · Post by **cdp** » Sun Nov 06, 2016 8:47 pm

Hi,

Let's say that:

- I stop a running but restartable Job Sequence (checkpoints enabled for each of its step) from the Director client.
- Then shut down the Engine server.
- Then shut down the Unix machine
- After a few hours, I restart the Unix server and the Engine server
....

Questions:

1) When I then go back to the Director client, will I be able to restart the Job Sequence from where it was stopped (from its last checkpoint) ?

2) I know that the checkpoints are lost if the Job sequence is reset. Is it the only way to clear/lose the checkpoints ?

3) Out of curiosity, where are the checkpoints information stored ? Xmeta ?

Thanks for the help

Mike · Post by **Mike** » Mon Nov 07, 2016 7:28 am

1) Yes, but be aware that as soon as the job sequence stops running, it will no longer record any checkpoints. So if you have any activities that were started by the job sequence, they will still be running and will not record a checkpoint when they finish.

2) Compiling a job sequence will also remove its checkpoints.

3) I've never been concerned about where they are stored. XMETA seems highly unlikely to me. My guess would be the RT_STATUS table for the job.

Mike

FranklinE · Post by **FranklinE** » Mon Nov 07, 2016 8:07 am

I take your scenario a bit differently. What you describe is a human controlled system incident, sort of like the old days when the janitor tripped over the power cord and pulled it out of the wall socket.

If you are initiating the actions as you describe, you are (and I write this not knowing why you are concerned about it all) conducting an undisciplined shut down.

Checkpoints are set and cleared by the job sequence. For a Job Activity stage, a disciplined shut down would start with the parallel job(s) first. Stopping them would then trigger the restartability function under the checkpoint, and the job sequence would stop without further action. Upon everything being brought back up, you would then need to issue a restart to the job sequence, which would then take care of any parallel job resets.

In our batch environment, we use Control-M, and job aborts are handled automatically when the Control-M job is rerun. The interface is a script, and we let the checkpoint issue the reset as needed. In the rare cases where a manual reset is needed, we use D‏irector.

The same should hold true for other stages, like Execute Command. Stop the underlying process, and let your checkpoint setting cover the rest.

In my experience, thankfully a very rare thing, a system-level event causing a shut down is unpredictable. Sometimes it follows a bottom-up path, and the checkpoints are valid for restart. Sometimes, as Mike mentions, it's more top-down and some processes remain running. In the latter case, you usually have no choice but to reset everything and start the job(s) from the top.

cdp · Post by **cdp** » Mon Nov 07, 2016 2:26 pm

thank you for your answers.

Franklin thanks for the reminder of stopping the Parallel job instead of the Job sequence.

There's an hardware change and I have been left with no choice but to stop DataStage if it is still running