Page 1 of 1

waitForWriteSignal(): Premature EOF on node

Posted: Thu Aug 03, 2017 3:22 pm
by js103755
Hello everyone,

I've been struggling with this issue since last couple of days. I've a job which is combining six datasets using funnel stage and the output of funnel stage is given as input to aggregator stage. The aggregator is grouping the different columns based on one id column and writing the output to a dataset.

The job has been running fine from last couple of months but in last migration its failing with below error log,

AGG_REV,0: Failure during execution of operator logic.
AGG_REV,0: Input 0 consumed 315815 records.
AGG_REV,0: Output 0 produced 2729 records.
AGG_REV,0: Fatal Error: Unable to allocate communication resources
STLMNT_UPD,0: Failure during execution of operator logic.
STLMNT_UPD,0: Input 0 consumed 0 records.
STLMNT_UPD,0: Output 0 produced 0 records.
STLMNT_UPD,0: Fatal Error: waitForWriteSignal(): Premature EOF on node fanucci Bad file descriptor
node_node1: Player 9 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 9 - Unexpected termination by Unix signal 11(SIGSEGV).
node_node1: Player 10 terminated unexpectedly.
main_program: APT_PMsectionLeader(1, node1), player 10 - Unexpected exit status 1.
APT_PMsectionLeader(1, node1), player 8 - Unexpected exit status 1.
APT_PMsectionLeader(2, node2), player 10 - Unexpected exit status 1.
main_program: Step execution finished with status = FAILED.


AGG_REV is aggregator stage name and STLMNT_UPD is target dataset name.

The job runs fine in dev, but its failing in QA.
I've looked through the forums but didn't get any solution. I checked in the target directory there is no descriptor file, but still the error says bad file descriptor. help please.

Posted: Fri Aug 04, 2017 1:05 pm
by UCDI
what happened before these errors?

Was there a warning before your very first log file line here? Any previous errors at all? It looks like this section of the log is actually after your real problem.

Posted: Mon Aug 07, 2017 9:50 am
by js103755
I didn't get any other error/warning messages, before the first line. I got the usual messages, see below.

Code: Select all

Parallel job initiated
Parallel job default NLS map UTF-8, default locale OFF
main_program: IBM InfoSphere DataStage Enterprise Edition 11.5.0.7555 
Copyright (c) 2001, 2005-2015 IBM Corporation. All rights reserved
main_program: The timezone environment variable TZ is currently not set in your environment which can lead to significant performance degradation. It is recommended that you set TZ=:/etc/localtime in your environment.
main_program: conductor uname: -s=Linux; -r=2.6.32-573.22.1.el6.x86_64; -v=#1 SMP Thu Mar 17 03:23:39 EDT 2016; -n=fanucci; -m=x86_64
main_program: orchgeneral: loaded
orchsort: loaded
orchstats: loaded
main_program: APT_SortedGroup2Operator::describeOperator nkeys: 1
main_program: APT configuration file: /opt/Infosphere/InfoServer/Server/Configurations/default.apt
{
	node "node1"
	{
		fastname "fanucci"
		pools ""
		resource disk "/opt/tempdata/datasets1" {pools ""}
		resource disk "/opt/tempdata/datasets2" {pools ""}
		resource disk "/opt/tempdata/datasets3" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch1" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch2" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch3" {pools ""}
	}
	node "node2"
	{
		fastname "fanucci"
		pools ""
		resource disk "/opt/tempdata/datasets4" {pools ""}
		resource disk "/opt/tempdata/datasets5" {pools ""}
		resource disk "/opt/tempdata/datasets6" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch4" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch5" {pools ""}
		resource scratchdisk "/opt/tempdata/scratch6" {pools ""}
	}
}

Posted: Mon Aug 07, 2017 10:07 am
by UCDI
is this load bigger than you had been running?
Maybe try setting the agg stage to sort mode?

Posted: Mon Aug 07, 2017 11:34 am
by js103755
The load is the usual what we've been running. Also, the aggregator was set to sort mode only.

Posted: Mon Aug 07, 2017 6:26 pm
by Mike
You've told the aggregator to expect sorted data. So have you partitioned and sorted the data by the keys that the aggregator is using? Have you set up your funnel stage to preserve the sort order if you have indeed sorted the data upstream?

Mike

Posted: Tue Aug 08, 2017 10:04 am
by js103755
I'm sorting the data in the input of aggregator stage, before that it's as it is. In the iput of aggregator stage I've used hash partition and sorting in ascending order. In the input of target dataset I've used same partition to collect the data.

It should work in this way as well right?

Posted: Tue Aug 08, 2017 12:56 pm
by js103755
Its working now, I kept one column at a time in the aggregation and tested the job. After testing for all columns, I found out that there are two columns which are causing problem when the preserve type is set to true. one has type bigint other has decimal. When I'm setting the preserve type to false and giving the type in the stage then the job is running without any issue. This is very strange, in the incoming data there is no way the type would change.

I was browsing IBM support page and found the exact issue:

JR53787: PARALLEL JOB FAILS WHEN AGGREGATOR STAGE USES BIGINT OR DOUBLE DATA TYPES WITH PRESERVE TYPE PROPERTY SET AT TRUE.
http://www-01.ibm.com/support/docview.w ... wg1JR53787

There is a patch install for this, but our DataStage install is the latest version and I assume they would have included these patch fixes in the latest version of install. Anyways, I will analyse the input data and try to find out if there is any issue with the data which might be causing the preserve type to fail.
Thanks all for your inputs.

Posted: Tue Aug 08, 2017 1:45 pm
by chulett
js103755 wrote:There is a patch install for this, but our DataStage install is the latest version and I assume they would have included these patch fixes in the latest version of install.
I would make no such assumption. :wink:

Verify.

Posted: Wed Sep 27, 2017 12:10 pm
by sjfearnside
Please let us know if the patch was the solution for your issue or something else?

Thanks