Multi valued Column in Hash File

Someswar Barkataki · Post by **Someswar Barkataki** » Tue Aug 31, 2004 2:18 pm

Hi,

Can anyone suggest me how to create a multi-valued colmn in a hash file ?
Column should display all the values separated by comma for a particular Key.

for example : input file with three columns c1,c2 & c3
c1-- c2 -- c3
NE-- 20-- 1
EX-- 30 -- 2
NE-- 40-- 3
EX-- 45-- 4
EX-- 50-- 5

And Hash file should show the result as: (c1 is my Key)
c1 -- c2 --------------c3
NE -- 20, 40 -------1, 3
EX -- 30 , 45, 50 --2,4,5

Thanks in advance,

Somesh

mhester · Post by **mhester** » Tue Aug 31, 2004 2:53 pm

Hash file writes are destructive, so having said that I have approached your issue in a variety of ways. One method I have employed was to use the same type of logic used for SCDs with a twist.

You would use your hash file as both a target and reference. Create stage variables to hold the contents of the columns (EX and NE). Perform a lookup on each incoming row of data and use logic within the stage variables to append the incoming EX and NE to an already existing value or to set the value of the stage variable to the incoming NE and EX if no lookup is found.

This still performs destructive writes, but each write will be building the record you want.

Another way to solve this without a reference lookup would be to utilize a DS routine in the transform that expects a key and whatever fields you are building and returns a string(s) suitable for writing to the hash as value (253) delimited strings.

Good luck

Regards,

ray.wurlod · Post by **ray.wurlod** » Tue Aug 31, 2004 4:37 pm

Welcome aboard! :D

How you create multi-valued columns depends on how you create the hashed file.
If you're using a Hashed File or UV stage to create the hashed file, specify M in the Type column in the Columns grid.
If you're using a CREATE TABLE statement, you specify that the column is MULTIVALUED within its column definition.
If you're using a CREATE.FILE or mkdbfile command, you specify when building the file dictionary.

Populate the hashed file normalized on the multi-valued column.

chulett · Post by **chulett** » Tue Aug 31, 2004 4:39 pm

I don't think the OP meant "multi-valued" in the Universe sense, did they? Seems like they were just looking for a column where the values are seperated by commas, which seems like simple concatenation to me.

mhester · Post by **mhester** » Tue Aug 31, 2004 4:59 pm

My original post was more slanted to their post of comma separated values, hence the reply that I gave.

Regards,

anupam · Post by **anupam** » Fri Sep 03, 2004 5:52 am

If you're using a Hashed File or UV stage to create the hashed file, specify M in the Type column in the Columns grid.

I want to create a hash file stage for multivalued columns,but not able to see the option available for specifying the "M" or MULTIVALUED ....

in hash file stage, there is no Type column. There is only one SQL Type Column Grid available for specifying the type of the column.

Please suggest how to create a multi-valued hash file using Hash File Stage in Datastage Version 7X

mhester · Post by **mhester** » Fri Sep 03, 2004 8:54 am

Anupam,

I do not believe DS has the ability (natively) to write multi-valued fields to a UV file - that's why I gave you my suggestion. I believe that in order to do this DS would have to read and write every record (output) so that a destructive write does not take place.

If the ability is there I have never used it nor am I aware of it.

Regards,

kduke · Post by **kduke** » Fri Sep 03, 2004 9:39 am

Michael

It can be done. I just wrote a job which takes a source. It writes to a hash file. It does a lookup on the same hash file. If the lookup was there then the output column was lookup.col1 :@VM: In.col1. You can aggregate to multivalues to a hash file.

Code: Select all


        JobsByCatHash
             | 
             v
UvStage -> Trans -> JobsByCatHash

UvStage
SELECT DS_JOBS.NAME, DS_JOBS.CATEGORY FROM DS_JOBS WHERE NAME NOT LIKE '\%' ORDER BY CATEGORY, NAME;

Trans
if IsNull(LkpEtlJobCatHash.JOB_NAME) then DsJobLinks.JOB_NAME else LkpEtlJobCatHash.JOB_NAME:@VM:DsJobLinks.JOB_NAME

Try it. It works great. This would be awesome on dates.

Now you need to create new dictionary items to associate the multivalues on the hash file just created.

tonystark622 · Post by **tonystark622** » Fri Sep 03, 2004 11:21 am

That's what I've been doing, too. Works great!

Tony

kduke · Post by **kduke** » Fri Sep 03, 2004 3:11 pm

Well, Tony they all said it did not work. You could of said something sooner.

tonystark622 · Post by **tonystark622** » Fri Sep 03, 2004 3:16 pm

Sorry Kim,

I figured that you folks are way smarter than me about this and I must have not understood the whole situation. When you described what you were doing, building the multi-value field in a hash, it just clicked with me.

Tony

anupam · Post by **anupam** » Sat Sep 04, 2004 12:37 am

Code: Select all

     JobsByCatHash 
             | 
             v 
UvStage -> Trans -> JobsByCatHash

It can be done using the method suggested by Kim and it is an acceptable way also.

Code: Select all

If you're using a Hashed File or UV stage to create the hashed file, specify M in the Type column in the Columns grid.

I was not able to see the TYPE column in the Columns Grid to specify the possible multivalued column in the hash file stage so i was just curious to know how is it possible to specify it in the Hash file stage.

kduke · Post by **kduke** » Sat Sep 04, 2004 7:31 am

That is the trick. In the hash file you do not treat this field as multivalued. You do so only in a UV stage. This allows one job to create the multivalues and another job to use them.

Most of the time I would only use this in a lookup. I would write a routine to do it. If you look at Ken Bland's SCD type 2 routine to locate a value by date. Then you no longer need the rollup routine. You can do that in a job, Also the convert '|' to @VM goes away in the other routine. The @VM are already there. Look at his routine, this is what I would create to use this new hash file.

ryoung011 · Post by **ryoung011** » Wed Dec 08, 2004 3:29 pm

Thanks for all of the great info. We are just staring the construction stage of an Enterprise DW and these posts are invaluable.

I am testing this as a solution for Type II Slowly Changing Dimension surrogate key lookups (in conjunction with Mr. Bland's routine) and have one question...

Is there a more efficient way of doing this? My test seems to be quite slow. I have the lookup hash configured with caching "Enabled, Lock For Updates".

My dimension has about 3M distinct natural key values, with about 9M records overall (average of three Type II changes per entity).

I am getting about 180 rows/sec. As you can imagine, this would take several hours to complete for an initial load.

One option I've researched is creating the multivalue strings in the database. I created a table in the dimension table's database using PL/SQL that accomplishes the same result and is MUCH faster (3000 rows/sec) which I subsequently dumped to a hash file for lookup (using the routine Mr. Bland supplied).

I'd like to keep it all in DataStage if possible.

Any ideas or configuration suggestions to speed this up?

Thanks!

kduke · Post by **kduke** » Wed Dec 08, 2004 5:22 pm

I think it is valid to do some of this in PLSQL if that speeds things up. I think doing it in a BEFORE SQL may make it look like it is all done in DataStage. I think most of us have used temp tables created with SQL to get around some performance issue. Maybe even Ray has used this trick.

I do not like to use aggregator stages. I usually do this in SQL or one of those sorting tools like CoSort. Sometimes going around DataStage is worth it in performance. I try to limit this type of thing.