Jobs aborted with "Write to dataset failed"

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Flyerman_2
Premium Member
Premium Member
Posts: 11
Joined: Mon Aug 17, 2009 9:42 am

Jobs aborted with "Write to dataset failed"

Post by Flyerman_2 »

Hi,

Datastage 8.0.1
OS: AIX 5.3.0.0

When trying to write to a dataset, I'm getting the following errors:

########################
FATAL :
########################
APT_CombinedOperatorController(7),4: Write to dataset on [fd 17] failed (Error 0) on node node5, hostname <Server name>
APT_CombinedOperatorController(7),4: Orchestrate was unable to write to any of the following files:
APT_CombinedOperatorController(7),4: /DataStage/data/<filename>
APT_CombinedOperatorController(7),0: Write to dataset on [fd 17] failed (Error 0) on node node1, hostname <Server name>
APT_CombinedOperatorController(7),0: Orchestrate was unable to write to any of the following files:
APT_CombinedOperatorController(7),0: /DataStage/data/<filename>
APT_CombinedOperatorController(7),4: Block write failure. Partition: 4
<Filename>,4: Failure during execution of operator logic.
APT_CombinedOperatorController(7),4: Fatal Error: File data set, file "/DataStage/data/<Filename>.ds".; output of "<Filename>": DM getOutputRecord error.
APT_CombinedOperatorController(7),0: Block write failure. Partition: 0
<Filename>,0: Failure during execution of operator logic.
APT_CombinedOperatorController(7),0: Fatal Error: File data set, file "/DataStage/data/<Filename>.ds".; output of "<Filename>": DM getOutputRecord error.
node_node1: Player 67 terminated unexpectedly.
node_node5: Player 64 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 67 - Unexpected exit status 1.
<Filename 2>,0: Failure during execution of operator logic.
<Filename 2>,0: Fatal Error: Unable to allocate communication resources
node_node1: Player 42 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 42 - Unexpected exit status 1. (...)
<Filename 2>,4: Failure during execution of operator logic.
<Filename 2>,4: Fatal Error: Unable to allocate communication resources
main_program: Step execution finished with status = FAILED.
########################

Failed to execute job :<Job Name>. Return Code : 16

In the same log, we see also:
Message:: main_program: The open files limit is 2000; raising to 2147483647.
I do not know if it is normal.

Another log in /DataStage/MetaData/<project_name>/&PH&/ gives
"DataStage Job 1035 Phantom 20950
readSocket() returned 16
DataStage Phantom Finished."
The Setting is unchanged. We have unix rights in the directories.


We have now this problem in 3 servers (2 of Production), with the same error message always in an old job.

We found by replacing a Join processing by a Lookup that worked fine but the issue was on the next job. :(
All these jobs worked for a long time. We have too many jobs and Lookup could not be always implemented.

We are looking for ulimit parameters:

We have the same values for all 3 servers

from Unix Box
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) 2097151
nofiles(descriptors) unlimited

but from SH -c "ulimit -a" (DataStage Administrator)
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 1572864
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) 0
nofiles(descriptors) unlimited

We can note we have 2 differences between the 2 commands. I do not know why ?

For information, in dsenv script we added months ago:
ulimit -d unlimited
ulimit -m unlimited
# ulimit -s unlimited
ulimit -f unlimited

Nothing else in DSPARAMS.

--------

This is <Server name>.apt file

{
node "node1"
{
fastname "<Server name>"
pools ""
resource disk "/DataStage/data/PX1/<project name>/DS" {pools ""}
resource scratchdisk "/DataStage/data/PX1/<project name>/SCRATCH" {pools ""}
}
... node "node6"
{
fastname "<Server name>"
pools ""
resource disk "/DataStage/data/PX6/<project name>/DS" {pools ""}
resource scratchdisk "/DataStage/data/PX6/<project name>/SCRATCH" {pools ""}
}
}

We have enough disk space, we verify File System during the run of the job, no significative evolution.

We have 6 File System, one by node, with more than 30Gb free by FS.

We check up also tmp directory: no problem of disk space.


Do you have some ideas ?

Thanks for your help.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Seems to me that a "Block write failure" is either because the disk is full or you have a media error / bad block / hardware issue. You monitored the space while the jobs ran and the error was generated?

Also, is your O/S 32bit or 64bit?
-craig

"You can never have too many knives" -- Logan Nine Fingers
Flyerman_2
Premium Member
Premium Member
Posts: 11
Joined: Mon Aug 17, 2009 9:42 am

Post by Flyerman_2 »

First, thank you for your help.

O/S is 64bit.

Yes, i monitored the space while the jobs ran and the error was generated. Nothing significant. The job failed after a little more than 1 minute.

What it is strange, is we have the same problem at the same moment in 3 differents servers not in the same place.

Some days before, we update in all servers our script of backup and restore. We just added the STOP and START for ASB Node like in the "InfoSphere Information Server Administration guide".
The return code is 0. So it seems ok.
And i do not see why this update could geneate this error.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Interesting that I'm thanked for my help and yet my attempt to help is rated as 'off-topic/superfluous'. Nice. :?

I ask re: the 'bitness' of your O/S as I've seen issues like this in a 32bit environment that did not occur in a 64bit one. I can't imagine any changes to your backup script would generate this error unless someone decided to run/test it while jobs were running. Speaking of which, what the heck does this mean?

"What it is strange, is we have the same problem at the same moment in 3 differents servers not in the same place."

Three different servers not in the same place? Are you saying this happened simultaneously on three different physical pieces of hardware? :shock:
-craig

"You can never have too many knives" -- Logan Nine Fingers
sjfearnside
Premium Member
Premium Member
Posts: 278
Joined: Wed Oct 03, 2007 8:45 am

Post by sjfearnside »

I am experiencing this problem now, did you solve it? if so, what was the solution?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Which problem, exactly? Block write failure? At the same moment in 3 differents servers not in the same place?
-craig

"You can never have too many knives" -- Logan Nine Fingers
sjfearnside
Premium Member
Premium Member
Posts: 278
Joined: Wed Oct 03, 2007 8:45 am

Post by sjfearnside »

Write to dataset on [fd 17] failed (Error 0) on node node5, hostname <Server name>
Nagaraj
Premium Member
Premium Member
Posts: 383
Joined: Thu Nov 08, 2007 12:32 am
Location: Bangalore

Post by Nagaraj »

Any other ideas to get around this block write failure.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You should start your own post if you are having a similar problem.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Nagaraj
Premium Member
Premium Member
Posts: 383
Joined: Thu Nov 08, 2007 12:32 am
Location: Bangalore

Post by Nagaraj »

chulett wrote:You should start your own post if you are having a similar problem.
i just thought this thread is still open and i will continue and mark this thread as resolved or workaround.

:)
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's the problem - you can't. It's not your thread.
-craig

"You can never have too many knives" -- Logan Nine Fingers
fridge
Premium Member
Premium Member
Posts: 136
Joined: Sat Jan 10, 2004 8:51 am

Post by fridge »

it may be worth checking if the dataset did get written too at all - check the size of the dataset files in the .../DataSets directory

reason I say this is that hit a problem some years ago where our dataset portions were throwing up a simalar error and after checking diskspace and filelimits on the user - I checked the size and each segment was failing at 512 bytes off 1gb

The issue was actually to do with the PX set up - I cant I am afaid remember the exact details - but was to do with the Memory Model - it was explained to me that the executables have x amount of memory to address and this can be configured y bytes for data, z bytes 'code' and so on. (to be honest my sysadmin got half way though this and I dozed off but you get the idea) - it was a simple unix command to change the configuration - the command syntax was supplied by Ascential (pre-IBM)

I know the above isnt a solution (if you havent solved it already) - but if you check the size as suggested and have simalar symptom (512 bytes off 1gb) will try to dig out my notes
ulab
Participant
Posts: 56
Joined: Mon Mar 16, 2009 4:58 am
Location: bangalore
Contact:

this is resolved after changing to configuration file config

Post by ulab »

thisissue got resolved after changing to configuration file config.apt
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
Post Reply