Page 1 of 1

The connection is broken (81002)

Posted: Wed Oct 24, 2007 9:39 pm
by chulett
Not your typical 81002 so please bear with me.

Scenario - new server setup just for DataStage, which installed and was running without issue. A standard 'root' install, was able to stop/restart DataStage from the command line as 'dsadm' like normal. Main 'ade' shared memory segment was owned by root. All normal stuff.

Today - they decide to add two more CPUs and, since the box will be rebooted, tweak two kernel parameters that were below minimums - MAXDSIZE and MAXTSIZE from what I recall. Box goes down, comes back up with new stuffs and I can no longer connect to DataStage from a client. From the command line, yes, but the client throws the dreaded 81002:

Code: Select all

Failed to connect to host: XXXXX, project: XXX
(The connection is broken (81002))

This happens immediately, either when clicking OK or when trying to pull down the Project list in the connection dialogue. Then it gets better. Attempts to shutdown DataStage fail:

Code: Select all

/opt/datastage/Ascential/DataStage/DSEngine $ ./bin/uv -admin -stop 
Unable to remove the following shared memory segment(s) during shutdown:
m          8 0xadec7512 --rw-rw-rw-      root      root  6500 16117
1 error(s) encountered during shutdown procedure.
DataStage Engine 7.5.1.2 instance "ade" may be in an inconsistent state.

Suddenly, in spite of the fact that root always owns that segment on all of my DataStage servers, this one can only be successfully shutdown by root. Once down, I can then restart it using the -start option as dsadm and everything comes back up with the segment now owned by dsadm. However, now I can connect from the client. So the process I'm suddenly saddled with after a host reboot is:

1. Confirm DataStage is up after the reboot.
2. Ask 'root' to stop DataStage.
3. Confirm DataStage is down and all resources released.
4. Restart DataStage as 'dsadm'

Any ideas on what could have changed that got me to where I am today? :?

Posted: Wed Oct 24, 2007 10:44 pm
by ArndW
Craig - I don't know what could be happening on your system; and I guess you don't have the luxury of removing the CPUs and resetting the configuration parameters to see if they really were the cause of your woes.

I think that the base shared segment is not removed with a ipcrm -m {segment} call, but when the last attached process stops using it. Therefore I think that your error message is due to some process still being attached. Could you try an "ipcs -m | grep 0xade" before attempting your next shutdown to try to identify any processes that might still be around?
Did you check to see that after restart your background processes are all running? Also, do you have a simple non-destructive job that you could try to run from the command line and also check from the command line - just to ensure that you only issue is client connectivity and not the actual DS engine itself?

Posted: Thu Oct 25, 2007 12:13 am
by chulett
They did reboot under the 'old' kernel again to no effect, the same problems were noted and resolutions required. Didn't pull out the new CPUs as we didn't see how that would be relevant. So, with this 'work around' in place we booted back under the new kernel as it needs to be pressed into service tonight.

One of my 'standard' checks is "ipcs -m |grep ade" as you noted and it showed the base segment as it always does. No processes other than "dsrpcd" were running and it was "listening" as it should have. And yes, I could run jobs and get into dssh / DS.TOOLS during this time, the only thing I couldn't seem to do was connect from any client tool. Oh, that and shutdown the server as dsadm. :? :evil:

I also could not remove the segment with ipcrm -m as dsadm, got a permission denied message. Only root could remove it or successfully shut DataStage down. Odd that starting DS with 'dsadm' on all of my other servers results in the base segment being owned by root but here it ends up being owned by dsadm - and that seems to be what is allowing it to 'work' for me, only then can I connect from my client. However, I could do all of this like normal as dsadm before the reboot.

Very odd.

Posted: Fri Oct 26, 2007 8:53 am
by chulett
Another interesting observation: In spite of logging onto the box and running all jobs as our normal 'runtime' userid of 'dsuser', they in fact now run as dsadm instead. :?

:!: I would really appreciate any suggestions on things I can check to determine what in the heck is suddenly different about this installation.

In case this helps... first, here is an ipcs dump from one of our long-standing servers:

Code: Select all

pdata01: /home/dsuser $ ipcs -m |grep ade
m  147463 0xadec7512 --rw-rw-rw-      root      root
m 1120268 0xadebf4f7 --rw-rw-rw-    dsuser    dstage
m 1101837 0xadeba971 --rw-rw-rw-    dsuser    dstage
m  797710 0xadebc086 --rw-rw-rw-  dsreader    dstage
m 1925135 0xadeb9e59 --rw-rw-rw-    dsuser    dstage
m 2304016 0xadeb9e58 --rw-rw-rw-    dsuser    dstage
m 3226642 0xadeb9e9d --rw-rw-rw-    dsuser    dstage
m    8211 0xadebd518 --rw-rw-rw-    dsuser    dstage
m   12308 0xadebf198 --rw-rw-rw-    dsuser    dstage
m  198677 0xadeb9e48 --rw-rw-rw-    dsuser    dstage
m   12310 0xadeb9e9c --rw-rw-rw-    dsuser    dstage
m 3722263 0xadebe7cd --rw-rw-rw-    dsuser    dstage
m 1090584 0xadebcacd --rw-rw-rw-    dsuser    dstage
m 3855386 0xadebe468 --rw-rw-rw-  dsreader    dstage
m 3937308 0xadeb9fe0 --rw-rw-rw-    dsuser    dstage
m 3995677 0xadeb9e91 --rw-rw-rw-    dsuser    dstage
m 3476510 0xadebf04a --rw-rw-rw-    dsuser    dstage
m  689183 0xadebd4b4 --rw-rw-rw-  dsreader    dstage
m 4284448 0xadeb9dd7 --rw-rw-rw-    dsuser    dstage
m 3400737 0xadeb9e87 --rw-rw-rw-    dsuser    dstage
m 3361826 0xadeb9e85 --rw-rw-rw-    dsuser    dstage
m 3714083 0xadeb9d52 --rw-rw-rw-    dsuser    dstage
m 3308580 0xadeb9e74 --rw-rw-rw-    dsuser    dstage
m 3051557 0xadebeea7 --rw-rw-rw-    dsuser    dstage
m 2844710 0xadeb9e69 --rw-rw-rw-    dsuser    dstage
m    4136 0xadeb9e65 --rw-rw-rw-    dsuser    dstage
m 2220073 0xadeb9e60 --rw-rw-rw-    dsuser    dstage
m    4138 0xadeb9e52 --rw-rw-rw-    dsuser    dstage
m 1812523 0xadeb9e56 --rw-rw-rw-    dsuser    dstage
m    2092 0xadeb9e50 --rw-rw-rw-    dsuser    dstage
m 2158637 0xadeb9e47 --rw-rw-rw-    dsuser    dstage
m    2095 0xadeb9e42 --rw-rw-rw-    dsuser    dstage
m    2096 0xadeb9e40 --rw-rw-rw-    dsuser    dstage
m 2092081 0xadeb9e35 --rw-rw-rw-    dsuser    dstage
m    2098 0xadeb9e2e --rw-rw-rw-    dsuser    dstage
m    2099 0xadeb9e0d --rw-rw-rw-    dsuser    dstage
m 2063415 0xadeb9711 --rw-rw-rw-    dsuser    dstage

And now the same dump on our troublesome server where jobs are running as dsuser and I am logged onto Director as dsuser:

Code: Select all

petl02: /home/dsuser $ ipcs -m |grep ade
m       1032 0xadec7512 --rw-rw-rw-     dsadm    dstage
m      12299 0xadebe148 --rw-rw-rw-     dsadm    dstage
m     138252 0xadebb489 --rw-rw-rw-     dsadm    dstage
m      30734 0xadebd886 --rw-rw-rw-     dsadm    dstage
m      55311 0xadebd883 --rw-rw-rw-     dsadm    dstage
m      37904 0xadebaafa --rw-rw-rw-     dsadm    dstage
m     341009 0xadebd863 --rw-rw-rw-     dsadm    dstage
m     220178 0xadebaaf0 --rw-rw-rw-     dsadm    dstage
m       5157 0xadebaae3 --rw-rw-rw-     dsadm    dstage
m       4134 0xadebaad9 --rw-rw-rw-     dsadm    dstage
m       4135 0xadebaacc --rw-rw-rw-     dsadm    dstage
m       3112 0xadebaac2 --rw-rw-rw-     dsadm    dstage
m       2089 0xadebaaa3 --rw-rw-rw-     dsadm    dstage
m       4138 0xadebaa97 --rw-rw-rw-     dsadm    dstage
m       5163 0xadebaa8a --rw-rw-rw-     dsadm    dstage
m       3116 0xadebaa80 --rw-rw-rw-     dsadm    dstage
m       3117 0xadebaa65 --rw-rw-rw-     dsadm    dstage
m       4142 0xadebaa5b --rw-rw-rw-     dsadm    dstage
m       3119 0xadebaa20 --rw-rw-rw-     dsadm    dstage
m       3120 0xadeba9de --rw-rw-rw-     dsadm    dstage
m       3121 0xadeba980 --rw-rw-rw-     dsadm    dstage
m       3122 0xadeba967 --rw-rw-rw-     dsadm    dstage
m       3123 0xadeba957 --rw-rw-rw-     dsadm    dstage
m       3124 0xadeba94d --rw-rw-rw-     dsadm    dstage
m       3125 0xadeba90c --rw-rw-rw-     dsadm    dstage
m       3126 0xadeba8f4 --rw-rw-rw-     dsadm    dstage
m       3127 0xadeba8e7 --rw-rw-rw-     dsadm    dstage
m       3128 0xadeba8da --rw-rw-rw-     dsadm    dstage
m       3129 0xadeba89d --rw-rw-rw-     dsadm    dstage
m       1082 0xadeba882 --rw-rw-rw-     dsadm    dstage
m       1083 0xadeba874 --rw-rw-rw-     dsadm    dstage
m       3132 0xadeba86a --rw-rw-rw-     dsadm    dstage
m       2109 0xadeba83e --rw-rw-rw-     dsadm    dstage
m       2110 0xadeba817 --rw-rw-rw-     dsadm    dstage
m       2111 0xadeba804 --rw-rw-rw-     dsadm    dstage
m       2112 0xadeba7fa --rw-rw-rw-     dsadm    dstage
m       2113 0xadeba7dc --rw-rw-rw-     dsadm    dstage
m       2114 0xadeba7bf --rw-rw-rw-     dsadm    dstage

:evil:

My only consolidation is the fact that I now have 2 more CPUs and the stray memory errors we were getting are gone after the kernel parameters were updated properly.

Posted: Fri Oct 26, 2007 4:49 pm
by ArndW
Craig - I don't have anything to add, the error does seem very odd. What about if you were to have the admin people restart DataStage from 'root' - do any of the symptoms persist (jobs running under dsadm, connectivity)? I would tend to think that something else has changed with regards to user permissions.
Also, did the system administrators perhaps apply some OS patch and then forget they had done so - perhaps you could compare the rellevel with one of the boxes where things are still running the way you expect?

Posted: Fri Oct 26, 2007 5:22 pm
by chulett
ArndW wrote:Craig - I don't have anything to add, the error does seem very odd. What about if you were to have the admin people restart DataStage from 'root' - do any of the symptoms persist (jobs running under dsadm, connectivity)? I would tend to think that something else has changed with regards to user permissions.

Tried that earlier. Any 'root' startup on this box results in the same problem - unable to connect from any client. [sigh]

ArndW also wrote:Also, did the system administrators perhaps apply some OS patch and then forget they had done so - perhaps you could compare the rellevel with one of the boxes where things are still running the way you expect?

Don't think so but I've asked, not sure how quickly anyone will jump on it for me.

Posted: Sat Oct 27, 2007 1:12 am
by ArndW
Craig - doing a 'uname -a' might show differences in the 3rd column, without having to make the sysadm people do a more in-depth check.
How much work would it be to do a re-install over the existing version?