Page 1 of 1

Dynamic hashed files and Static hashed files - pros and cons

Posted: Sat Oct 22, 2005 5:26 am
by ArndW
I have been doing some testing on relative performance of hashed files at my current site, but believe this applies to many different sites and wanted to open a discussion on the subject of Dynamic Vs. Static hashed files.

In the configuration here (multi-CPU AIX, 1000+ daily jobs) there are many hashed files in use - up to very large files used for changed data detection. Many of these files are cleared and reloaded daily {different subject about the necessity of this}. I have taken some typical hashed files, which are re-created daily as Dynamic, General, Minimum Modulus approximately correct for the ultimate file size and done performance measurements on the jobs (runtimes, cpu, IO/sec, etc.) and then computed the optimal static file sizes +10% and types, reconfigured the files, and re-run the processes.

These jobs run over 30 minutes on average, some over an hour. Without exception the static hashed files produce lower runtimes. These range between 5% to 20% and they produce less system load; depending upon the file type, size, whether loaded to memory or not, read/write ratios and read frequencies.

None of this should be news to most of us who have used hashed files, but my quandary is whether or not it is worth using static hashed files instead of Dynamic files - and where to draw a virtual line to make the decision to go one way or another.

The downside of static hashed files is that the performance of a badly hashed static file is so miserable that it usually outweighs any benefit if not properly maintained.

At this location the decision is clear, as there are hundreds of hashed files in use and the system I/O is currently the bottleneck. Thus changing to static hashed files and implementing an automated method of monitoring the files to ensure that they never get badly hashed will result in better performance.

I also think that the UniVerse internal file units for dynamic files are not shared across users, whilst those of non-type 30 files are definately shared. Thus by going to static hashed files the MFILES parameter needs to be raised while the T30FILES doesn't need to set as high.

But what about sites that do not rely as heavily on hashed files? I would normally unequivocably state "stick with dynamic files", but I am very interested in opinions (for and against) as well as any other observations.

Posted: Sat Oct 22, 2005 10:38 am
by kduke
If I plan on being at a site for awhile then static is fine otherwise I do not trust the developers to know how to resize a file or even know if it is bad. So few people even monitor stats as to rows per second or anything of that nature. Take a poll and you will see how few have statistical data on their ETL. All they look at is overall run times.

To play with has file types when you do not own FAST is even worse in my opinion. After you leave they will not know why these processes keep getting slower and slower. They will blame DataStage. Not fair when they do not know how DataStage works well enough to figure out this issue.

I would say very few sites use hash file cache or IPCs. Huge performance gains with these. Most of our lookups are joins in the source queries at this site. Haven't been here long enough to see if that is even good or bad.

Posted: Sat Oct 22, 2005 4:43 pm
by ray.wurlod
For larger volumes of data a well-tuned static hashed file will out-perform a dynamic hashed file during the load phase. A dynamic hashed file where the minimum modulus has been appropriately set will, however, get close to the load performance of the well-tuned static hashed file.

It's really just time shifting - do I take the hits of extending the structure when creating the hashed file, or while it's loading?

It would be nice if there were a version of FAST (hashed file tuning tool) available for DataStage hashed files. [Chances are that FAST for UniVerse would work OK, provided that DataStage's different magic number didn't interfere.] This tool not only tunes hashed files very accurately, but can also schedule the RESIZE operations.

Of course, in an environment where the hashed files are cleared daily, the "correct" size can only ever be theoretical until the new data are in. But it must be better to retain the shape (static hashed or minimum modulus) rather than to split down to modulus 1 then grow again!

On the question of file units, the first observation is that static hashed files require just one, while dynamic hashed files require two. Have you contemplated using public shared caching? This will reduce total file units system-wide. Though you do need to do some adjustment in uvconfig to allow same.

I believe that the main awareness needed is to keep hashed files small - not to load unnecessary columns and not to load unnecessary rows into them. Once that principle has been established, then comes the time to worry about static versus dynamic.

In short: a well-tuned static hashed file will always perform better during loading than the equivalent dynamic hashed file (if only because there are code paths in the hashing algorithms that can be avoided). In retrieval each should be about the same - ideally exactly 1.0 logical I/O operations per key.

On the down side, getting a static hashed file to be well-tuned is a bit of a black art and, as Arnd noted, badly-tuned static hashed files perform abysmally. For most sites, therefore, I'd agree with Kim; stick with dynamic hashed files and their "automatic table space management", perhaps implementing minimum modulus settings to pre-allocate and to preserve disk space.

Posted: Sun Oct 23, 2005 3:37 am
by ArndW
Ray,

a couple of comments and observations that might be coupled with questions on hashed files

- I think that when the same dynamic hashed file is opened by several processes concurrently each gets it's own entry from the T30FILES pool, while static hashed files are shared from the MFILES rotating pool. Although this doesn't normally make a difference, on a very busy system it is possible that a static hashed file's unit might be closed and re-opened by the system (less likely with several users on the same file, but still possible) while the dynamic files remain open. This can reduce the effective performance of static files negatively while the dynamic files continue to perform as expected.

- the default hashing algorithm in dynamic files is GENERAL, this needs to traverse the whole key string in order to build up it's hash number. If a file's key is most different in the rightmost bytes then by using a file type of 2 one reduces both the number of bytes touched to make a hash as well as the complexity of computation. Normally the CPU aspect of reading and writing to files is so secondary to the I/O that it is negligible but I have some files here that contain 120 million rows where even the write is noticeably faster.


Maintaining static hashed files is certainly more of an art than a science, but there are cases where the effort necessary to create a "safe" environment for these files to make sure they don't overflow badly is worth it.

Posted: Sun Oct 23, 2005 8:59 am
by kcbland
IMO, it's even harder to teach the "art" of hash files than the "science". Without a doubt, every single DS engagement I've been on over the last 7+ years has been impacted by ETL developers ability to understand the underyling concepts of data movement. The added complexity of hash file tuning has been beyond their care and concern. If I came away from an engagement with the ETL team at least setting minimun modulos on dynamic hash files than I was successful. The desire to have the tool do too much often required a rule-of-thumb approach for most aspects in the ETL architecture.

I don't know why management at companies think this way. They spend $500K on a tool, $500K on a database, $1M on a server cluster, and then expect any off-the-street developer to be able to write top-shelf code. It's kind of like saying if we build the most intelligent Ferrari, then any driver off the street can drive it to the utmost extreme. I don't get it. So many customers have this attitude that anyone can do development, because the tool was supposed to make it easy.

So, to relate back to hash files, my goal is to just get them to use dynamic hash files, no maintenance, and during development size the minimum modulos just a little bit higher than the average expected high-watermark. They should be fine for every load cycle that is smaller in average row size and average row count, and when the row count is above average they have a little more wiggle room. The initial hit of creating the hash file is accommodated by the fact that I try to create all hash files during jobstream startup, thus masking that overhead/delay during other processing, making the hash files available once they're need for transformation.

For instance, during the acquisition phase of getting data from source databases, you can have a job mass-creating the hash files. This also allows for any 64BIT files to be created at that time, size the Designer job doesn't have that feature.

Sometimes the natives feel better knowing it "happens by magic" until they become familiar with the jobstream enough to now start to learn more about the hash files underneath. I've found at least 1 year is required for average developers to even begin to put it all together in their head what's going on.

Posted: Sun Oct 23, 2005 3:12 pm
by ray.wurlod
Arnd,

T30FILE does not govern the issue of file units; this is managed purely from the rotating file pool for all file types.

What T30FILE does is to specify the size of a table in shared memory (visible via smat -d) in which the current sizing parameters of each dynamic hashed file are kept, so that they are immediately visible to all users who have each such file open. There is one entry system-wide for each dynamic hashed file open, irrespective of how many processes have it open. One of the counters in the table is the number of opens, another is the number of active Selects.

Information from the T30FILE table is flushed back to the dynamic hashed file's header when the file is closed.

Access to change the table's contents is single threaded through the T30FILE semaphore.

Both the GENERAL and SEQ.NUM hashing algorithm traverse the same number of key characters. The latter gives emphasis to characters "0" through "9". (I know that this is not the case with most static hashed file hashing algorithms, but it is true for dynamics; I teach this from time to time in the "UniVerse Internals" class.)

Ken's post pretty much summarises the approach I take.

Posted: Sun Oct 23, 2005 3:44 pm
by ArndW
I thought that the dynamic files didn't go through the normal rotating file pool because of the T30FILES parameter; obviously wrongly. So one does need to add the number of concurrent dynamic files that are open when doing sizing estimates for the MFILES.

It is odd how, back in the earlier days of PRIMOS and older UNIX systems, each increase in the UVCONFIG could have a huge impact by reserving a couple of hundred KB that couldn't be swapped out. Now one tends to think in terms of a couple of Mb or even Gb; so these internal settings often do not have as much of a potential negative impact as they used to. I would think that the default settings for a lot of the memory-intensive attributes should be set much higher by Ascential - especially for DataStage applications.

Posted: Sun Oct 23, 2005 4:03 pm
by ray.wurlod
MFILES is best sized as large as possible, subject to the restriction that you don't exceed the UNIX limit (NFILE) and reserve eight file units. That is, MFILES <= NFILE - 8

It doesn't really matter what it is you're opening.

For UniVerse, you also reserve a file unit per UV/Net connection and one for any device (e.g. tape drive) that you don't want rotated out.