Hash File: Input/Output

timwalsh · Post by **timwalsh** » Tue Dec 02, 2003 1:32 am

Hello All,

When using hash files, has anyone ever had a different layout between their input hash file layout, and their outbound file(s) layout? The same question could be asked about Sequential Files!

Lookup Table =Inbound=> HashStage =Outbound=> TransStage ....

I believe that this is just an issue with DataStage's GUI, I just was thinking/hoping that there was a reason why your inbound structure and your outbound are saved and loaded seperately in a Hash/Sequential stage.

I've been a DS user for almost five years and have just learned (second nature), to ensure that Hash File inbound/outbound file names and file layouts are the same. However, my clients seem to perpetuately have a problem with this. Why can't you just load this information once, and it should automatically be the same within that Hash Stage (regardless of inbound or outbound).

I hope that my question/thoughts aren't too confusing or cryptic.

Cheers,

Tim

girishoak · Post by **girishoak** » Tue Dec 02, 2003 2:58 am

Hi Tim,

I also have faced similar kind of problem. I had a hash file consisting 3 columns with first column as Key column. I populated the hash file from a table.
When I used the same hash file with the same data but different key column (say 2 or 3 column). The data got swapped automatically. The reason I could evaluate is that Key column should not contain any null value. If there is null value in a column (extracted from table)which you are specifying as Key column now, then it swaps data in column1 to column2 just to fulfill the condition of not null.

But if you have to columns (one key column specified at the time of extraction and the any other column) containing unique and not null values. you can very well swap the key columns in input as well as output stage.

Hope this will help you. Please let me know if you have any comments.

Girish Oak

Amos.Rosmarin · Post by **Amos.Rosmarin** » Tue Dec 02, 2003 6:04 am

It will be clearer if you look at the hash stage not an interface to a single hash but as a technology that allows you to access one or more hash files and therefor the input of the stage and the output can each access a different hash.
Since hash files are not relational database tables you can not switch the column positions nor change the key (there are conditions were you can do it but i'll leave it for now)

HTH,
Amos

chulett · Post by **chulett** » Tue Dec 02, 2003 7:05 am

I agree with Amos - there's no guarentee that just because you are writing to one hash file that you will automatically be reading from the same one. Even though that's what 'everyone' does most of the time.

Now, it would be nice it there was option to speed things up when that is in fact what you want to do, something to populate both 'sides' at once. But then it might get tricky if there are multiple Input and Output links from the same Hash stage to choose from...

ray.wurlod · Post by **ray.wurlod** » Tue Dec 02, 2003 10:30 am

Think about it from an object perspective.

Column definitions are a property of the link. The stage merely serves as a means (an object) to deliver the appropriate functions (methods) for communicating to a particular form of persistent storage.

For example, the "getnext" method is handled by a Sequential File stage by reading the next line (or block of bytes, depending on properties), by a Hashed File by reading the next selected record, by a UV or ODBC stage by issing an SQLFetch() function to get the next row of the result set generated by the SELECT statement or stored procedure (a property), and so on.

Every job component IS an object, and every job component in the DataStage Repository has an associated OLE type indicating the class from which it can be instantiated.

timwalsh · Post by **timwalsh** » Thu Dec 04, 2003 10:04 am

I fully agree with all of you .... I just spend a quarter of my day catching my client's mistakes due to DataStage's GUI. It's a catch-22, simple GUI with less flexibility ... or repetative actions with more flexible GUI.

I'm just thinking out loud about the two tab method that Ascential uses for the vast majority of it's DataStage stages ... as well as other components of it's GUI. It is flexible, which is nice, but it is also overly complex and leads to small mistakes.

Have you heard of Ascential revamping it's GUI? I wish they would harness the visual effectiveness of their Transform Stage to drive the GUI design of some other their other stages such as the Aggregator, Sort, Hash, etc .... maybe I'm dreaming.

Thanks for all the input.

Cheers,

Tim

kduke · Post by **kduke** » Thu Dec 04, 2003 11:06 am

Tim

I agree with some of what you said but there are times when metadata from the same source table does not match on purpose. It can be a very effective tool if used correctly. Partiial key lookup is good case in point.

I too have found flaws in customer jobs by looking for metadata mismatches. I automated this in DwNav for one job. I do not think the queries would be difficult to change to look at metadata for a hash file is compared across all jobs. I could write it and send it to you if you want.

The larger DS sites have a Quality Assurance person or team. Part of their role should be validating metadata. Most DS sites have no controls on their metadata. If the job works then what is your problem is their attitude. Changing the hash file strucure in one job may make another job produce bad data. It is fairly easy query in doc_tools.mdb to find all the jobs which use a specific hash file or sequential file. I do not know MetaStage but I am sure it should be even easier. They have where used reports in MetaStage. I wanted quick access to this in DwNav so it is part of the main tree. I do not think in terms of stages and links. These are concepts Ascential needed to build ETL. I think of source and target files. Algorythms plus data structures equals programming. Data structures are more important than the methods or ETL that it took to populate it. DataStage may not exist 5 years from now and you will always need accurate data in your data warehouse.

It does not matter how you get this information on jobs and their relationship to files or tables they use. The important thing is you use this information to know when you change one job that it may force you to change other jobs. This is called impact analysis. If your target system has hundreds of tables then you better have a handle on this.

The goal of MetaStage is to control this across products like between DataStage and OLAP tools like MicroStrategy. Unless your company is well organized and future minded then this is overkill. DwNav is just designed to help navigate DataStage metadata. I designed it to help me quickly get fimiliar with metadata on an existing DataStage site. As a consultant it is important to ramp up quickly. I don't know how others do it. Normally MetaStage is not installed or not available to all DataStage developers.

I am curious. How do other people do:

1. Impact analysis
2. Where used reports. (what job uses what table)
3. What is your longest running job?
4. What is your slowest running job by rows per second?
5. Metadata mismatch on the same file across jobs
6. What kind of documentation do you have?
7. Does you documentation have:
a. Files/Tables read or written to
b. Parameter values used
c. Routines used
d. Rows per second
e. job run times
g. picture of job
8. Modification history

Routines add another level of where used reporting. I have heard MetaStage does a good job of this. DwNav cannot do this at all.

I think too many of us are only concerned about making something work and not the overall impact of ETL. I guess most of you have complete control over all the above. That is why you do not need an inexpensive tool like DwNav or a much more comprehensive tool like MetaStage.

I would love to do a system audit at most DataStage sites and see how well you really do these things. In the future these concepts will not be an option.

Sorry I need the points. I haven't posted much this week. Need food stamps.

Kim.

kduke · Post by **kduke** » Mon Dec 08, 2003 2:48 pm

Hey, nobody ever answered my question "how do you get the above information?". I got to looking around and I thought I could automate a lot of this so I did last week. Here is an eample of good DataStage documentation. It comes directly out of DwNav which is available at:

http://www.datastagexchange.com/DwNav/

Shameless plug. You people asked for examples so here they are at :

http://www.duke-consulting.com/DwNav/Da ... _Index.htm

Hopefully at some point we will be able to build a class around DataStage basics and teach it over the web. A lot of these jobs are built around a simple class I put together for a friend.

I still want to know what kind of documentation do others have. How complete is it? Who gets it? Is it non-technical? Would it pass a system audit?

Ken and Ray let us see the good stuff.

Kim.

mhester · Post by **mhester** » Mon Dec 08, 2003 3:10 pm

I am curious. How do other people do:

1. Impact analysis
2. Where used reports. (what job uses what table)
3. What is your longest running job?
4. What is your slowest running job by rows per second?
5. Metadata mismatch on the same file across jobs
6. What kind of documentation do you have?
7. Does you documentation have:
a. Files/Tables read or written to
b. Parameter values used
c. Routines used
d. Rows per second
e. job run times
g. picture of job
8. Modification history

#1 - MetaStage
#2 - MetaStage
#5 - MetaStage
#3, 4 - Custom code that can be run on any platform and for any version of DS.
#6 - 8 - Some custom stuff, some MetaStage, some Doc tool and of course DwNav!

Regards,

Michael Hester

kduke · Post by **kduke** » Mon Dec 08, 2003 3:44 pm

Michael

Lol. Very funny. My point is that most people that have MetaStage do not use it. They do not seem to value either good documentation or even metadata. So many good developers do not document their work. I think their work is incomplete.

Kim.

kduke · Post by **kduke** » Tue Dec 09, 2003 10:59 am

Michael

This is exactly what I thought you are the only one doing any kind of documentation.