Page 1 of 1

"Data source is empty" viewing large hashed file w

Posted: Mon May 16, 2005 3:15 pm
by dufflbag
We have a job that creates a dynamic hashed file that is used as the input to a transform stage.
We use an after-stage ExecTCL DELETE.FILE command in the job's ultimate transform to clean up the hashed file.
The hashed file's output link contains the following SELECT clauses in the Selection tab:

@ID <> '' BY EMPSSN BY PATDOB BY GENDER BY varEligStartDt

We sort in this way, because its performance has proven to be faster than the Sort stage.



Here is the problem.

Under certain circumstances, the hashed file gets loaded but then passes no records to the next transform.
The job does not abort. It goes about its business processing zero records and finishes OK.
Even pressing the View Data button results in "Data source is empty," after a delay of four to five minutes.
(Note that I had to disable the after-stage DELETE.FILE command to get the hashed file to stick around.)

We never encountered this problem until last week. However, we're not sure we've ever passed quite so many
records to the hashed file before. The most recent file to fail wrote 7,292,355 records to the hashed file.
Smaller source files still consistently work as expected.


Here is a look at the hashed file's directory for the run with 7,292,355 records:

$ls -l t_DACV2_cl1659_f23766_Elig_Hash
total 2648320
-rw-rw-r-- 1 userid secgroup 1111371776 May 14 00:11 DATA.30
-rw-rw-r-- 1 userid secgroup 244328448 May 14 00:11 OVER.30


By removing the selection criteria, I can get the data browser to display data.
Even adding back some of the criteria, the data browser will display the data.
I have not tried many permutations of combining the criteria, but I have successfully
viewed the hashed file data using each of sort fields in the shortened BY clause.
Each of the following SELECT clauses works:

@ID <> ''
@ID <> '' BY EMPSSN
@ID <> '' BY PATDOB
@ID <> '' BY GENDER
@ID <> '' BY varEligStartDt

but

@ID <> '' BY EMPSSN BY PATDOB BY GENDER BY varEligStartDt

fails again.



What are the factors that limit DataStage's ability to handle a large amount of data in this way?
Temp space? Memory? The same job processing the same input file can fail on one server and then
succeed on another.


Due to deadline pressure, we ended up processing the eight required files by moving from server to server.
We didn't plan it this way, but each time the job stopped working on one server, we'd try to fix it,
eventually give up, and then run the original job on a different server.


Ran it on new production server:
succeeded on file #1, failed on file #2

Tested it on development server:
succeeded on file #2, so continued processing files #2 - #7, failed on file #8

Ran it on other production server:
succeeded on file #8


It's the way the same job that failed on one server then worked on another server
that makes me want to know what environmental factors or tunables are affecting the job.
If there is something tunable, that would be helpful to know. If we simply need to redesign
the job to handle the larger data feeds, that would be helpful to know, too.

The job looks like this:

Code: Select all

  s    x    h
  o    f -> a
  u -> o    s -> xform -> target
  r    r -> h
  c    m    |
  e    | -> |

Source contains quarterly records out of which we create three monthly records.


We ensured that no NULLs were being sent to any of the sort fields.
We even tried "landing" the hashed file to a sequential file on the way to the transform stage.

Any insight you can provide would be appreciated.


DataStage 7.1.r1
AIX 5.2

If you'd like more information, I'll be glad to provide it.
Please let me know.


Doug Carpenter

Posted: Mon May 16, 2005 6:26 pm
by ray.wurlod
Welcome aboard! :D

Your analysis is sound. However, there is no documented limit on the complexity of sorting. Get in touch with your support provider; the query sort engine can be tuned, but it's a delicate process (you alter parameters called QS... in uvconfig - read the dire warnings posted there). It may also be beneficial to increase scratch space, etc. in uvconfig.

Posted: Tue Jul 26, 2005 10:40 am
by dufflbag
ray.wurlod wrote:It may also be beneficial to increase scratch space, etc. in uvconfig.


It turned out that the UVTEMP parameter in uvconfig points to a directory
that is "only" 500 MB. Our support provider recommends we use a directory
with much more space available.

Posted: Tue Jul 26, 2005 12:23 pm
by kduke
If you are using UV stages then use single quotes. If you are using hash files then use a WITH clause with double quotes around values. A WHERE clause is used with UV stages only.