DataStage Server Job Concepts

jerome_rajan · Post by **jerome_rajan** » Tue Apr 15, 2014 10:18 pm

Hi,
I am currently reviewing some decade old server jobs. Having worked with parallel jobs all the while, I'm running into a few conceptual road-blocks. Please help me understand the following:

Code: Select all

1. There's a job that uses a Universe table as a reference for a lookup. Why and why not a hash file?
2. What exactly are Universe tables? I thought they used to be DataStage internal tables but these jobs are creating tables in the uv database
3. I see that there is no 'JOIN' stage in a server job. What then would be the ideal approach to join 2 voluminous datasets?
4. DataSet is a parallel concept. What comes closest in nature to it in a server job? What are the advantages of using a hash file as an intermediate data store over a sequential file?

ray.wurlod · Post by **ray.wurlod** » Tue Apr 15, 2014 10:42 pm

1. Perhaps they wanted to use some additional filtering, via user-defined SQL?

2. UniVerse is a database product originally created by VMARK Software and on which DataStage was originally built. All UniVerse tables are hashed files, but also have "system table" entries that describe them, and the privileges that have been granted on them. Not all hashed files are UniVerse tables.

3. If they are in, or accessible to, the same database server, do it there. If they are text files, use the Merge stage. Otherwise use a Transformer stage to effect a lookup (which, by default, is a left outer join, but you can use the NOTFOUND link variable to constrain it back to an inner join).

4. There is nothing resembling a Data Set in server job world. Actually that's not true, there is something in UniVerse called a "distributed" hashed file, in which component hashed files are described by a descriptor. But you will not find these documented in DataStage documentation.

4. Hashed files are probably not useful as intermediate storage if you have duplicate key values, since hashed files destructively overwrite when the key is the same. Also, writing to and reading from sequential files is much faster than streaming data into/out of hashed files. Hashed files are intended for key-based access, and are VERY FAST at that.