Multi Instance Jobs Failing and/or Hanging

fridge · Post by **fridge** » Thu Jul 24, 2014 11:09 am

This is a new environment (migration from 8.1 and aix) although we have migrated several other similar applications to linux 9.1 and without issue.

We have a very simple script that is resetting and firing off DS multi instance jobs which is causing a few problems. I have stripped down and recreated the problem so that the multi instance job used is very simple (a row generator and a peek stage) and reduced the script so is just a very simple loop and a couple of DS job commands.

for (( c=1; c<=25; c++ ))
do
dsjob -server :31539 -run -jobstatus -mode RESET -wait dpr_tst1 DSTest4.$c
dsjob -server :31539 -run dpr_tst1 DSTest4.$c
if [ $? -ne 0 ]; then
echo "Job run failed"
exit 1
fi
echo "Job run successfull"
done

Occaisionally the instances run through successfully but other times we get one of 2 problems.
1. Several of the invocations run through but then a reset fails to open the project and issues 'Status code = 39202'. The proceeding invocation then hangs whilst it is attempting to access a named pipe/fifo.
2. After several successfull invocations the reset appears to hang and then the connection just drops.

There appears to be no obvious pattern. It can happen regardless of how few/many invocations we try to run.

As anyone come across such a problem?

chulett · Post by **chulett** » Thu Jul 24, 2014 12:48 pm

I'd double-check that your kernel settings are in line with what the new version and platform expects. I would also suggest a small sleep between the reset and the run, in my experience it can sometimes take a moment for the reset to actually finish up post-return and we had some odd errors before we put a "sleep 5" between them.

PaulVL · Post by **PaulVL** » Thu Jul 24, 2014 2:23 pm

Look to your Auto Job Log Purge settings.

See if you are doing it based upon days or instances.

Seek to use days if you can.

Giving a quantity there often breaks multi instance jobs if your concurent number of jobs is greater than your auto purge setting.

I'll let you think about how that would come into play.

fridge · Post by **fridge** » Mon Jul 28, 2014 6:52 am

Thanks for the suggestions but the issue does not seem to be related to these.

I attempted to recreate the problem on another environment (same OS, DS version etc) but was no issue here. The only difference was on the MEMOFF settings which were ...

DMEMOFF = 0x0
PMEMOFF = 0x0
CMEMOFF = 0x0
NMEMOFF = 0x0

I reconfigured on the failng environment which did the trick. I do however now get the following problems ...

Whilst running LIST.READU ...
Abnormal termination of DataStage.
Fault type is 11. Layer type is Command Language.
Segmentation fault

Also I sometimes get a DS project opening issue in aa couple of other jobs though with a differenet status code to the original issue.

Getting the parameter list of the job...
ERROR: Failed to open project

Status code = 81015

Any ideas?

Thanks & Regards.

chulett · Post by **chulett** » Mon Jul 28, 2014 7:22 am

So... did you actually confirm your kernel settings were appropriate? Hard to know for certain when you make a blanket statement like that. I'd also be curious if you compared them between the two "same OS" environments.

And pretty much every discussion here of the MEMOFF settings notes that you should only be changing them under the direction of IBM Support and are not something I would consider "reconfiguring" on my own.

Have you involved support yet?

fridge · Post by **fridge** » Mon Jul 28, 2014 9:53 am

Apologies - I checked them against a sister environment where this issue does not seem to occur and they matched.

fridge · Post by **fridge** » Mon Jul 28, 2014 9:58 am

One other thing to note as well is that we now seem to be seeing the deadlock daemon log file (dsdlockd.log) constantly bee updated which we had not had before. I have been searching for this and this seems to happen when certain files that should be owned by root are not. This does not however appear to be so in this case ...

ls -al | grep rws
-rws--x--x 1 root dsusrgrp 61512 Nov 9 2012 DBsetup
-rwsr-x--x 1 root dsusrgrp 1568632 Nov 9 2012 dsdlockd
-rwsr-x--x 1 root dsusrgrp 1533752 Nov 9 2012 dslictool
-rws--x--x 1 root dsusrgrp 12192 Nov 9 2012 dstskup
-rwsr-x--x 1 root dsusrgrp 1545128 Nov 9 2012 list_readu
-rwsr-x--x 1 root dsusrgrp 1532960 May 9 15:17 load_NLS_shm
-rwsr-x--x 1 root dsusrgrp 52544 Nov 9 2012 uv