|
||||||||||||||
| Tech Tips |
||||||||||||||
|
As part of our on-going efforts to improve the services we provide at the DSXchange, we are adding a new "Tech Tips" column that highlights some of the wealth of technical "How To's" that are contained on our website. Our first tip of the month is taken directly from the most popular post (2600+ views!) on our site, a thread discussing the ins and outs of hashed files. Much of the interest can be contributed to the following detailed information provided by one of our top posters, kcbland (Ken Bland). If you'd like to view the original thread you can click on: http://www.dsxchange.com/viewtopic.php?t=85364
Hash Files in DataStage
What Exactly Are Hash Files?
The beauty of the dynamic hash file, as compared to the static hash file, is that it will grow on demand. At a predetermined threshold, the file will automatically double its size and redistribute the data within the file. This removes the data in the overflow file and brings it back within the data file. The dynamic hash file defaults to an initial sizing value of 1. This means that the file will constantly resize by doubling itself to meet the incoming dataset. Clearing the hash file will shrink it back to the initial sizing. To find a row in a hash file, you must know the key that was used to deposit the row somewhere within the file. The data is stored as a continuous string of characters, with a column separator (@AM system variable). Each column is positional in nature and relative to the columns preceding it. It should become obvious that a hash file is almost the same as a sequential file, meaning columns are positional and referenced by position, not name. In fact, column names are irrelevant, you only need to know the key construct and what data is located in what positional column. Because data is stored as a text string, there is no explicit data typing enforced within a hash file. The only rule in a hash file is to not use restricted ASCII characters within your data, as they are internal data separators. Furthermore, each row within a hash file is allowed to have it's own structure, because again data is stored as a text string. Therefore, hash files are non-structured free-form scratch files! How Are They Meant To Be Used by DataStageŽ? Hash files should be limited to three main uses: 1. Small reference lookups by primary key 2. Small staging files usually when a reference capability has to exist 3. Non-persistent scratchpad work files. Small reference lookups by primary key - DataStageŽ has the ability to cache lookups in memory, as long as they are hash files. Lookups are not joins, they are specific searches in a hash file for a row using the key value for that row. Since you will give the key value to the hash file row, the algorithm quickly converts the key to the location of the row within the hash file and returns the contents at that address. This type of activity using ODBC/OCI stages is unacceptable for performance reasons because of the lack of memory caching capability. In addition, as more instantiated jobs perform the same lookup, the database will become saturated with lookup queries. Small staging files usually when a reference capability has to exist - A transformation job may prepare data that will be needed for reference in another job. This is accomplished quite easily by staging the data into a hash file. In order to make this data available for reference without a hash file would require either loading the reference into a target table or some sort of ETL scratch table. Non-persistent scratchpad work files - Hash files are wonderful mechanisms for staging data if it is needed for a reference lookup. The best choice for data that is not needed for a reference lookup is a sequential file, because of the appending nature of writing to a sequential file. Hash files have the overhead of figuring out where the row should go, checking the space available within the hash file, and then either writing it to the data section or the overflow. How Can A Hash File Be Abused in a DataStageŽ Design?
Tricks, Features, and Unintended (but Cool) Results
You can see that the DATA.30 file has an initial size of 4096 and the OVER.30 file has a size of 2048. Now run your job again, this time for all of your rows to see what happens to the hash file.
The data file has now grown by doubling in size again and again to meet the incoming data size. The job ended with data in the over flow file. This hash file is not tuned as an optimal reference lookup, as a referenced row must be checked in the data file and then potential checked in the less efficient overflow file. At this point we could do some research and compute a modulus for this hash file. However, through a few rounds of experimentation we can arrive at a sufficiently sized file. In your Designer client set the minimum modulus upwards. Since modulus is computed using average bytes per row and number of rows, each hash file will be tuned according to the nature of the data that will go into it. Experiment re-running the job increasing the modulus until running the job produces all data in the data file and the over flow is back to the original 2048 bytes. Watch the file as the job runs, you should see it created at the initial size, and then after so many rows the job monitor will show the job stalling on rows/sec as the hash file dynamically grows and then pick up again on rows/sec.
Keep in mind that the minimum modulus means that this hash file will always start at this size, and grow by doubling. The only way to reclaim this disk space is to remove the hash file, clearing it will only return it to its minimum modulus. Externally pathed hash files Were DataStageŽ to always require a directory path for hash files, many people would have never encountered the frustrating issues relating to the legacy hash file technology from the prior versions of the engine. DataStageŽ by default will put hash files into the project. When it does this, it puts something similar to a database synonym into an internal project table (the dreaded VOC). If you erased the hash file in at the OS level using the "rm" command, you'll find that you will never be able to recreate a hash file under this name again. This is because that synonym is still in the VOC. Using the DELETE.FILE command in the DSEngine shell will fail because the physical file parts are missing, and therefore it doesn't remove the VOC entry. You have to manually remove the VOC entry by typing "DELETE VOC hashfilename". The way to avoid this is to use externally pathed hash files. The DSEngine will not create a VOC entry for your hash file. By directly qualifying the location of the hash file, you can put hash files anywhere you want. Having work files created, modified, and deleted in the sensitive area as the project directory is playing with fire. Not only does this separate the critical pieces of the ETL tool from the scratch work files, it helps you manage the hash files by exposing them for tuning and for easy removal. Non-structured data A hash file has no internal form other than the key structure. Since the key is the address to where a row of data resides, it has the only absolute form and structure. For example, a hash file for holding customer information has a primary key of the customer number. However, there are different types of customer records with different columns and definitions. For each record type, you could have a distinct hash file. Or, you could add record type as part of the primary key and have one hash file. You could create many different column definitions for each customer record type, but use one hash file to hold all of the data. I don't however recommend this type of data model. De-duplicating data One and only one row can exist in a hash file with a given key. When writing to a hash file, DataStageŽ will not check to see if a row is already in the hash file under that key. It will write or overwrite new data into that location without blinking an eye. It is incumbent upon a developer to check if a row exists via a reference lookup and then decide how to handle the write. There are times when sorted or ordered data has to be de-duplicated about the primary key. In this situation, you may wish to only keep the last occurrence of the primary key. Writing to a hash file rows with duplicate primary keys will leave you with the last occurrence of that primary key. Sorted going in, randomized coming out If you write sorted data to a hash file, it will not be loaded quicker into the hash file. This is because the hashing algorithm has a randomizing effect as to where the data is located within the hash file. You can see this example by writing sorted data into a hash file, then using a job to scan the hash file and write it out to a sequential text file. There is no easy way to order data coming out of an external hash file. You could use the UV/ODBC (Universe on release 6) stage to order the data, because this stage treats the hash file like a relational data source and gives you SQL command capability, at a maintenance and performance cost. Your best way to sort hashed data is to sort it using a third-party sort, or even the Unix sort command, after extracting the data out of the hash file. Handy Tips Make sure your hash file has an identified primary key column, in your case PRODUCT_CODE sounds like it should be set to YES. Make sure you SAVE YOUR HASHFILE METADATA TO THE LIBRARY. In another job that has to lookup against this hash file, LOAD THAT SAVED METADATA into your hash file reference stage. Press view data, you should see your data displayed. Whenever you do reference lookups, JUST AS IN SQL , you must make sure you have trimmed data if there's a chance of leading or trailing spaces screwing up the equivalency in the reference. If you are comparing numbers, you have to make sure you don't getting screwed up by leading or trailing zeroes (0.00 is NOT EQUAL to 0, 0123 is NOT EQUAL to 123). There is no datatyping, it is all string comparison in DataStage. Verify that you are not gettting messed up by this. An easy way is to VIEW DATA and then double-click on the displayed field contents and look for highlighted spaces leading or trailing. Ken Bland is President of Kenneth Bland Associates, a company focused on delivering quality consulting for the data integration and data warehousing markets. You can read more about KBA and the services they provide at www.kennethbland.com.
|