Tech Tips

DataStage Enterprise Edition (PX) Tip for Beginners - Lookup Types
By Vincent McBurney

Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient?

This question has popped up several times over on the DSXChange. In DataStage Server jobs the answer is quite simple: Local hashed files are the fastest method of a key based lookup, as long as the time taken to build the hashed file does not wipe out your benefits from using it.

In a parallel job there are a very large number of stages that can be used as a lookup, a much wider variety than server jobs, this includes most data sources and the parallel staging formats of datasets and lookup filesets. I have discounted database lookups as the overhead of the database connectivity and any network passage makes them slower than most local storage.

I did a test comparing datasets to sequential files to lookup filesets and increased row volumes to see how they responded. The test had three jobs, each with a sequential file input stage and a reference stage writing to a copy stage.

Small lookups
I set the input and lookup volumes to 1000 rows. All three jobs processed in 17 or 18 seconds. No lookup tables were created apart from the existing lookup fileset one. This indicates the lookup data fit into memory and did not overflow to a resource file.

1 Million Row Test
The lookup dataset took 35 seconds, the lookup fileset took 18 seconds and the lookup sequential file took 35 seconds even though it had to partition the data. I assume this is because the input also had to be partitioned and this was the bottleneck in the job.

2 Million Rows
I'm starting to see some big differences now. The lookup fileset at 45 seconds is only three times the length of the 1000 row test. Dataset is up to 1:17 and sequential file up to 1:32. The cost of partitioning the lookup data is really showing now.

3 million rows
The fileset is still at 45 seconds and swallowed up the extra 1 million rows with ease. The dataset is now up to 2:06 and the sequential file is up to 2:20.

As a final test I replaced the lookup stage with a join stage and tested the dataset and sequential file reference links. The dataset join finished in 1:02 and the sequential file join finished in 1:15. A large join proved faster than a large lookup but not as fast as a lookup file.

Conclusion
If your lookup size is low enough to fit into memory then the source is irrelevant, they all load up very quickly--even database lookups are fast. If you have very large lookup files spilling into lookup table resources then the lookup fileset outstrips the other options. A join also becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources.

I usually go with lookups for code to description or code to key type lookups regardless of the size. I reserve the joins for references that bring back lots of columns. I will certainly be making more use of the lookup fileset to get more performance from jobs.


Sparse database lookups, which I didn't test for, are an option if you have a very large reference table and a small number of input rows.


About this Author: Vincent has been fiddling around with data for over ten years now and has moved from the hack and slash of manual SQL commands to the smooth ride of high-end ETL and data quality tools. He has worked in Europe and the United States but has settled back into his birthplace of Melbourne, Australia with a family and a house on an acre and a horribly slow dial-up connection. When he's not busy spending his spare time answering numerous questions on the DSXchange he's probably updating his blog "An Experts Guide to WebSphere Information Integration" at http://blogs.ittoolbox.com/bi/websphere/.


Copyright © 2006 DCS-Strategies