HFC hashed file calculations

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

HFC hashed file calculations

Post by clarcombe »

I am performing some optimisation on some jobs and noticed that by far the best performance comes from creating the HF (type 2) before running the job .

Using the parameters generated in the HFC.exe, in development I know how many lines will be in each hashed file. When this runs in production I won't know, so I have to "guess" what the best settings for the hashed files will be or make the files super big to accomodate any errors.

For example an average row size of 40 for 8.4m lines gives me
2 229693 1 32BIT

But when I run in production, this could be more or less.

Question
What is the relationship between the av size (40) and number of rows (8.4m) and the value 229693. Is there anyway I can write a routine to calculate this.

Thanks
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
Generally the most significant , performance wise, impact is the number of groups built as you create the file.
The number og groups required depends on the number of records your about to process divided by the number of rows that fits in one group (the hashed file is built from) the size of each hashed file group can be 2k or 4k (group size 1 or 2)
So having done that calculation you can set the group count properly.

In my experiance, once using a disk storage machine instead of local disks, there is no real benefit to using statis hashed files over dynamic ones, but maybe others have a different experiance.

you can get an estimated starting point if there is a real working process
or get an estimate that you will monitor and change if needs be.

IHTH (I Hope This Helps),
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

Hi Roy,

I am trying to understand what you mean by groups.

The three fields returned by the HFC are
FileType
Modulo
Separation

Is the separation what you mean by a group ?

As the separation and file type will remain static, what I need to (roughly) calculate is the modulo

How can I achieve this? What is it a function of ?

As for the disk storage, we are not that advanced here (yet!), we still used local Windows disks. If you have any recommendations for alternative disk storage I am all ears.

Using static HFs against dynamic, I am doubling the throughput time.

Thanks
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Colin - This post by our friend Ken Bland may help.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I can't recall whether the Help in HFC provides the algorithm for modulo. Replying to Colin by PM.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

It did Craig, in as much as I saw that Ray wrote the original HFC program. So I sent him a mail how the calculations are arrived at. :)

I am almost there!
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
Post Reply