Page 1 of 1

error running any job with lookup stage

Posted: Thu Dec 26, 2013 9:13 pm
by srividya
Hi,

i have few jobs, well any job with a lookup or a join stage getting hung in my testing environment

All these jobs have a common flow

We read from Oracle Database, use a lookup stage to lookup data from Oracle DB, and write data to a Dataset

What we did to verify?

we checked with the DBAs , there are no locks happening on the DB
we cleared all the RT logs, &PH&
bounced the datastage server twice without any luck
created copy of the job, replaced it with join stage, same result.
removed join/lookup, implemented the logic in the source oracle stage, it completes within 30 seconds.
Moved the same job to another environment, pointed to the testing database, job completes within 30 seconds again

i am not sure what we have missed to check. I am thinking of deleting and re-creating the project tomorrow or in the next week, but would like to understand if there is anything else i can look at

Also i noticed that , every time we try to run the jobs that get hung, a PID as below is generated.

dsadm 32416 1 0 Dec26 ? 00:00:00 /opt/app/xxxxxxxxx/InformationServer8.7/Server/PXEngine/bin/osh -f RT_SC59/OshScript.osh -monitorport 13400 -pf RT_SC59/jpfile -impexp_charset UTF-8 -string_charset UTF-8 -input_charset UTF-8 -output_charset UTF-8 -collation_sequence OFF

i have not seen this type of PID earlier, the information from all over the forum has confused me more. May be i will take a stab at it again after sometime. How can i cleanup PIDs with this message?

Appreciate your help on this.

Thank you
Sri

Posted: Fri Dec 27, 2013 6:36 am
by priyadarshikunal
I don't think this process is causing the issue, this is normal.

Are you sure that the queries are fine? can you monitor the number of buffergets if its increasing? can you see any progress in the monitor?

Posted: Fri Dec 27, 2013 7:58 am
by srividya
the queries are fine, remove lookup stage, dump both the main query, and lookup data to peek stages, the job is done in 30 seconds.

as soon as the process kicks off, it gets into a dormant state, it waits forever, till we logout the PIDs from director, attempting to release resources was unsuccessful

Posted: Fri Dec 27, 2013 1:55 pm
by soumya5891
Are you using any kind of partition in the source data just before entering to lookup or keep it as auto?

Posted: Fri Dec 27, 2013 3:03 pm
by pavi
I believe it is a memory issue.What configeration are you using?What is the count of records which are flowing through reference?Are you doing an explicit sort before join stage?How are the cpu stats while you are running the job.

Posted: Fri Dec 27, 2013 8:13 pm
by srividya
Data is about 100 rows from reference, we have about 800K records from the source. the sort is carried out in the query, so we don't have any sorts applied in data stage.

disk space on scratch is about 20% and on the server it is 27%.
swap memory utilization is 50% at any time, when the job was in working condition, as i have said this is an existing process running fine till 7 am that day. :roll:

the process gets into a "hung" state even before i can check for CPU stats or memory usage.

it looks like datastage has forgot processing this job, as soon as it generates the main_program information log :shock:
the only thing i can see is a PID similar to the one i initially posted.

Posted: Fri Jan 03, 2014 3:24 pm
by srividya
The issue was resolved once we restarted the server. we do not know what the problem is and were forced to restart before we could understand the issue as the testing phase was getting delayed :(