Fifo \pipe error
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
Fifo \pipe error
Hi,
I've been dealing with an issue recently that I cant find a ton of causes for. I have a bunch of delta sequential jobs that use about 6 other jobs inside them, all with different invocation ids.
Now I've been load testing these delta sequential jobs currently. Only running around 8 of them at the same time. Where in production there will be upwards of 30+. However about 90% of the time 1 of these 8 jobs will fail. This is due to one of the jobs within the sequence failing. Now the job within the sequence and the sequence that fails is completely random. And I've also ran into this error running a single sequence. But the logs give me the following lines of code:
OPENSEQ '\\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL' called: 10:39:36 02 OCT 2014
It repeats this OPENSEQ message about every second for usually exactly 2 minutes. "application" is our project name, and "app_splunk_message.customer_natl" is the job that failed. Not sure what all the other junk is exactly.
Then afterwards it give me the following error:
Error setting up internal communications (fifo \\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL) STATUS() 2
The only real resource I found online about this issue is here
http://www-01.ibm.com/support/docview.w ... wg21445893
Our admin has confirmed its not virus scans, and there is plenty of disk space available while these jobs are running. So any more input / ideas is much appreciated!
Thanks,
Taylor
I've been dealing with an issue recently that I cant find a ton of causes for. I have a bunch of delta sequential jobs that use about 6 other jobs inside them, all with different invocation ids.
Now I've been load testing these delta sequential jobs currently. Only running around 8 of them at the same time. Where in production there will be upwards of 30+. However about 90% of the time 1 of these 8 jobs will fail. This is due to one of the jobs within the sequence failing. Now the job within the sequence and the sequence that fails is completely random. And I've also ran into this error running a single sequence. But the logs give me the following lines of code:
OPENSEQ '\\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL' called: 10:39:36 02 OCT 2014
It repeats this OPENSEQ message about every second for usually exactly 2 minutes. "application" is our project name, and "app_splunk_message.customer_natl" is the job that failed. Not sure what all the other junk is exactly.
Then afterwards it give me the following error:
Error setting up internal communications (fifo \\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL) STATUS() 2
The only real resource I found online about this issue is here
http://www-01.ibm.com/support/docview.w ... wg21445893
Our admin has confirmed its not virus scans, and there is plenty of disk space available while these jobs are running. So any more input / ideas is much appreciated!
Thanks,
Taylor
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Could it be that the multiple instances are all (or some of them) trying to access file \\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL at the same time? The operating system only allows one writer at a time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
Interesting. I was assuming, much like Ray, that this happened when multiple instances were stepping on each other. But if it can happen while the job runs in isolation that's a whole 'nuther kettle of fish.
I believe that STATUS of 2 means "file not found". If you were on a UNIX server I'd suggest making sure your open files limit was high enough but no clue what the equivalent would be for Windows. I'd involve your official support provider on this one.
I believe that STATUS of 2 means "file not found". If you were on a UNIX server I'd suggest making sure your open files limit was high enough but no clue what the equivalent would be for Windows. I'd involve your official support provider on this one.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
Was going to link to that one as well but it is so UNIX-centric that I mostly decided to stick with this from the wrap-up paragraph:
"If the above tests do not isolate the cause of file system i/o problem, then it may be necessary to contact Information Server support for assistance in performing a system trace (truss or strace) of the dsapi process launching the failing jobs to track down the actual OS operations which are failing."
"If the above tests do not isolate the cause of file system i/o problem, then it may be necessary to contact Information Server support for assistance in performing a system trace (truss or strace) of the dsapi process launching the failing jobs to track down the actual OS operations which are failing."
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
We worked with an experienced consultant today, and he narrowed it down to a process on our server that was causing this issue. Something called "sh.exe" is randomly breaking and causing this error. Still have yet to determine why its happening.
**As a side note, we worked with IBM before to fix another timeout issue, and their solution was to set "APT_PM_USE_STANDALONE_EXE = 1". This was supposed to be avoiding the shell, and it resolved the immediate issue. **
However we assumed that the sh.exe would not be getting called anymore. But it's getting called somehow.
My question is now, does anyone know a way to completely avoid calling this "sh.exe" process? Or know why when its being called, it randomly breaks jobs?
**As a side note, we worked with IBM before to fix another timeout issue, and their solution was to set "APT_PM_USE_STANDALONE_EXE = 1". This was supposed to be avoiding the shell, and it resolved the immediate issue. **
However we assumed that the sh.exe would not be getting called anymore. But it's getting called somehow.
My question is now, does anyone know a way to completely avoid calling this "sh.exe" process? Or know why when its being called, it randomly breaks jobs?
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
-
- Premium Member
- Posts: 32
- Joined: Wed Aug 20, 2014 11:17 am
For purposes of updating this post with the solution:
We found that a MKS Toolkit file (mkstk.dll) in system32 was showing up as unregistered by windows. And now that we have registered this .dll, this errors have seemed to vanished. IBM told us this was probably because our servers were not hooked up to internet(and still aren't) when Datastage was installed, so this file never got registered.
We found that a MKS Toolkit file (mkstk.dll) in system32 was showing up as unregistered by windows. And now that we have registered this .dll, this errors have seemed to vanished. IBM told us this was probably because our servers were not hooked up to internet(and still aren't) when Datastage was installed, so this file never got registered.