"Data source is empty" viewing large hashed file w

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dufflbag
Participant
Posts: 4
Joined: Wed Dec 01, 2004 10:20 am

"Data source is empty" viewing large hashed file w

Post by dufflbag »

We have a job that creates a dynamic hashed file that is used as the input to a transform stage.
We use an after-stage ExecTCL DELETE.FILE command in the job's ultimate transform to clean up the hashed file.
The hashed file's output link contains the following SELECT clauses in the Selection tab:

@ID <> '' BY EMPSSN BY PATDOB BY GENDER BY varEligStartDt

We sort in this way, because its performance has proven to be faster than the Sort stage.



Here is the problem.

Under certain circumstances, the hashed file gets loaded but then passes no records to the next transform.
The job does not abort. It goes about its business processing zero records and finishes OK.
Even pressing the View Data button results in "Data source is empty," after a delay of four to five minutes.
(Note that I had to disable the after-stage DELETE.FILE command to get the hashed file to stick around.)

We never encountered this problem until last week. However, we're not sure we've ever passed quite so many
records to the hashed file before. The most recent file to fail wrote 7,292,355 records to the hashed file.
Smaller source files still consistently work as expected.


Here is a look at the hashed file's directory for the run with 7,292,355 records:

$ls -l t_DACV2_cl1659_f23766_Elig_Hash
total 2648320
-rw-rw-r-- 1 userid secgroup 1111371776 May 14 00:11 DATA.30
-rw-rw-r-- 1 userid secgroup 244328448 May 14 00:11 OVER.30


By removing the selection criteria, I can get the data browser to display data.
Even adding back some of the criteria, the data browser will display the data.
I have not tried many permutations of combining the criteria, but I have successfully
viewed the hashed file data using each of sort fields in the shortened BY clause.
Each of the following SELECT clauses works:

@ID <> ''
@ID <> '' BY EMPSSN
@ID <> '' BY PATDOB
@ID <> '' BY GENDER
@ID <> '' BY varEligStartDt

but

@ID <> '' BY EMPSSN BY PATDOB BY GENDER BY varEligStartDt

fails again.



What are the factors that limit DataStage's ability to handle a large amount of data in this way?
Temp space? Memory? The same job processing the same input file can fail on one server and then
succeed on another.


Due to deadline pressure, we ended up processing the eight required files by moving from server to server.
We didn't plan it this way, but each time the job stopped working on one server, we'd try to fix it,
eventually give up, and then run the original job on a different server.


Ran it on new production server:
succeeded on file #1, failed on file #2

Tested it on development server:
succeeded on file #2, so continued processing files #2 - #7, failed on file #8

Ran it on other production server:
succeeded on file #8


It's the way the same job that failed on one server then worked on another server
that makes me want to know what environmental factors or tunables are affecting the job.
If there is something tunable, that would be helpful to know. If we simply need to redesign
the job to handle the larger data feeds, that would be helpful to know, too.

The job looks like this:

Code: Select all

  s    x    h
  o    f -> a
  u -> o    s -> xform -> target
  r    r -> h
  c    m    |
  e    | -> |

Source contains quarterly records out of which we create three monthly records.


We ensured that no NULLs were being sent to any of the sort fields.
We even tried "landing" the hashed file to a sequential file on the way to the transform stage.

Any insight you can provide would be appreciated.


DataStage 7.1.r1
AIX 5.2

If you'd like more information, I'll be glad to provide it.
Please let me know.


Doug Carpenter
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

Your analysis is sound. However, there is no documented limit on the complexity of sorting. Get in touch with your support provider; the query sort engine can be tuned, but it's a delicate process (you alter parameters called QS... in uvconfig - read the dire warnings posted there). It may also be beneficial to increase scratch space, etc. in uvconfig.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dufflbag
Participant
Posts: 4
Joined: Wed Dec 01, 2004 10:20 am

Post by dufflbag »

ray.wurlod wrote:It may also be beneficial to increase scratch space, etc. in uvconfig.


It turned out that the UVTEMP parameter in uvconfig points to a directory
that is "only" 500 MB. Our support provider recommends we use a directory
with much more space available.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

If you are using UV stages then use single quotes. If you are using hash files then use a WITH clause with double quotes around values. A WHERE clause is used with UV stages only.
Mamu Kim
Post Reply