Using hash files instead of UV tables for multirow

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Here's the subroutine for converting a reverse sorted and grouped comma separated sequential file of natural to surrogate key columns described in the previous post. You could paste this logic into a subroutine. The InputArg value is the fully qualified sequential file name with a ".???" extension. The logic removes the current extension and adds ".rlp".

This subroutine could be called in the after routine of the job that selects the rows and outputs to the sequential file. You could then load this data into a hash file for reference purposes. IMPORTANT! Dates should be in YYYY-MM-DD or YYYY-MM-DD HH24:MI:SS format!

Code: Select all

    FunctionName = "KBASkeyRollup"
      RollupFileDtl = InputArg
      RollupFile = RollupFileDtl[1,LEN(RollupFileDtl)-3]:"rlp"

      Call DSLogInfo("Opening detail file [":RollupFileDtl:"]", FunctionName)
      OPENSEQ RollupFileDtl TO S.Detail Else
         Call DSLogFatal("Unable to open detail file: ":RollupFileDtl, FunctionName)
      End
      Call DSLogInfo("Opened detail file [":RollupFileDtl:"]", FunctionName)

      Call DSLogInfo("Opening rollup file [":RollupFile:"]", FunctionName)
      ValidOpen = @FALSE
      OPENSEQ RollupFile TO S.Rollup Then
         ValidOpen = @TRUE
      End Else
         CREATE S.Rollup Then ValidOpen = @TRUE
      End
      If NOT(ValidOpen) Then
         Call DSLogFatal("Could not create the file: ":RollupFile, FunctionName)
      End
      Call DSLogInfo("Opened rollup file [":RollupFile:"]", FunctionName)

      RowCount = 0
      OutputRowCount = 0
      RollupRow = ""
      LastRollupID = "LAST"
      Done = @FALSE
      Loop
         READSEQ row FROM S.Detail Else Done = @TRUE
      Until Done Do
         RowCount += 1
         source_system_code = field(row,',',1)
         source_system_key = field(row,',',2)
         surrogate_key = field(row,',',3)
         time_variant_start_date = field(row,',',4)
         time_variant_version_nbr = field(row,',',5)

         RollupID = field(row,',',1,2)
         If LastRollupID = "LAST" Then LastRollupID = RollupID
         If RollupID # LastRollupID Then
            OutputRowCount += 1
            CONVERT @AM TO "," IN RollupRow
            WRITESEQ RollupRow ON S.Rollup Else
               Call DSLogFatal("Unable to write to file ":RollupFile, FunctionName)
            END
            RollupRow = ""
            LastRollupID = RollupID
         End
         If RollupRow = "" Then Delimiter = "" Else Delimiter = "|"
         RollupRow = source_system_code
         RollupRow = source_system_key
         RollupRow = RollupRow:Delimiter:surrogate_key
         RollupRow = RollupRow:Delimiter:time_variant_start_date
         RollupRow = RollupRow:Delimiter:time_variant_version_nbr
      Repeat
      If RollupRow # "" Then
         OutputRowCount += 1
         CONVERT @AM TO "," IN RollupRow
         WRITESEQ RollupRow ON S.Rollup Else
            Call DSLogFatal("Unable to write to file ":RollupFile, FunctionName)
         END
      End
      WEOFSEQ S.Rollup

      CLOSESEQ S.Rollup
      CLOSESEQ S.Detail

      Call DSLogInfo("Finished creating rollup file: [":RollupFile:"]", FunctionName)
      Call DSLogInfo("Input Rows Processed: [":RowCount:"]", FunctionName)
      Call DSLogInfo("Output Rows Written: [":OutputRowCount:"]", FunctionName)

      ErrorCode = 0                      ; * set this to non-zero to stop the stage/job
Kenneth Bland
Last edited by kcbland on Tue Oct 28, 2003 9:16 am, edited 1 time in total.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Here's a function you could use to pick the appropriate surrogate key for a business date. Call the function, passing it the business date, the hash referenced row of surrogate keys, and the hash referenced row of time variant start dates for those surrogate keys. The function scans the array of dates, locating the point in time that the business date would have fallen, and returns the corresponding surrogate key effective at that time.
I call this function KBAVariantSkey:

Code: Select all

      FindDate = Arg1
      SurrKeys = Arg2
      TimeVariantStartDates = Arg3

      CONVERT "|" TO @VM IN TimeVariantStartDates

      Begin Case
         Case FindDate = "" OR ISNULL(FindDate)
            SurrKey = 0
         Case SurrKeys = "" OR ISNULL(SurrKeys)
            SurrKey = 0
         Case TimeVariantStartDates = "" OR ISNULL(TimeVariantStartDates)
            SurrKey = 0
         Case @TRUE
            Found = @FALSE
            LOCATE FindDate IN TimeVariantStartDates<1> BY "DL" SETTING Position Then
               Found = @TRUE
            End
            SurrKey = FIELD(SurrKeys, "|", Position)
            If NOT((Found OR Position # 0) AND SurrKey # "") Then
               SurrKeyCnt = DCOUNT(SurrKeys, "|")
               SurrKey = FIELD(SurrKeys, "|", SurrKeyCnt)
            End
      End Case

      Ans = SurrKey
Kenneth Bland
Last edited by kcbland on Tue Nov 22, 2005 7:41 am, edited 2 times in total.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Here's a function to return the most currently assigned surrogate key. Simply call this function passing it the hash reference list of surrogate keys. Since it's pre-sorted reverse-chronologically, the new surrogate must be in the first position.

I called this function KBACurrentSkey:

Code: Select all

      Ans = Field(Arg1, "|", 1)

Kenneth Bland
Last edited by kcbland on Tue Oct 28, 2003 9:17 am, edited 1 time in total.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The thing to note from what I'm trying to do here is use hash files as they are meant to be used IMO. Hash files are optimized structures for one-to-one reference lookups. When you start adding secondary indexes, you're moving away from using an ETL tool and instead are learning about database technology.

The time spent pre-sorting of the data and then rolling into a lookup can vastly be gained back once this type of optimized structure is being used for reference purposes. When the same slowly changing dimension rows are repeatedly being queried, the database is wasting time and i/o doing the same sorting and grouping and max logic over and over. This SQL is some of the worst performing SQL, and without a specially contrived multi-value surrogate key cross-reference table every row would have the following SQL executed for each foreign key substitution:

Code: Select all

SELECT MAX(skeyCNTRY.CNTRY_ID) from skeyCNTRY
WHERE skeyCNTRY.Source_System_Code = GEO.Source_System_Code
      AND skeyCNTRY.Source_System_Key = GEO.Source_System_Key
      AND skeyCNTRY.Time_Variant_Start_Date <= yourfactdate 
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Ken

Look at this topic:

viewtopic.php?t=88934

You can do the rollup in a job. You can also use @VM and not "|". Makes this a lot easier.
Mamu Kim
ryoung011
Participant
Posts: 10
Joined: Mon May 03, 2004 2:02 pm
Location: Atlanta, GA
Contact:

Post by ryoung011 »

Just a note to future users and FYI...

I had to use @FM in DS 7x (Linux RH AS 3.0) to get this to work. @VM would not work correctly in the locate function.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Sure it does. LOCATE is smart. If you LOCATE in Rec then uses @FM. LOCATE in Rec<2> will look for @VM as separators. LOCATE in Rec<2,3> will use @SVM as a separator.

The learning curve for these little details is steep. Ken, Ray and others have used Universe and other similar databases for years. Most of us had a mentor to teach these details. It maybe much harder for you to learn this when your mentor is on the other side of the world and he cannot watch you code these problems. You got it to work which is awesome.

I think we need to post the jobs and routines for users to see a complete solution. Ken, Ray and others have tried to bridge this gap. My hope is Ascential helps Ray and Ken to release as much technical information as possible. All they want to do is help us become better developers with DataStage. Better developers should equate to happier customers. Right?
Mamu Kim
willpeng
Participant
Posts: 18
Joined: Wed Apr 07, 2004 9:24 pm
Location: Middletown, NJ

Post by willpeng »

kduke wrote:All they want to do is help us become better developers with DataStage. Better developers should equate to happier customers. Right?
Right! :shock:

Just one word... BellSouth

No matter how good we are, they were not happy with what we did.
William Peng
DW/ETL Consultant
Middletown, NJ
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

The solution at Bellsouth was a political one and not an ETL one. Our team was still a good team.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Didn't Scott Adams (author of Dilbert) once work there too? :lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
him121
Premium Member
Premium Member
Posts: 55
Joined: Sat Aug 07, 2004 1:50 am

Post by him121 »

Thanks Huy for help.

But I used Universe stage ..with User difined query like
Select code from HASH where L_lmt <= input And U_lmt >= input;
and I did Lookup with UV Stage directly...
but here i needed to conver ..decimal values into Varchar...

Thanks...anyway
dial70
Participant
Posts: 1
Joined: Mon Sep 26, 2005 9:01 am

Post by dial70 »

I checked the function to get the surrkey, but
kcbland wrote:
LOCATE FindDate IN TimeVariantStartDates BY "DL" SETTING Position Then
Found = @TRUE
End

Kenneth Bland
LOCATE ever returns the second SurrKey, if FindDate is not greater than the first Date.
iE: FindDate = 2005-10-10
TimeVariantStartDates = 2005-11-15|2005-11-01|2005-10-15|2005-10-01
SurrKeys= 40|30|20|10

Result = 30 !

Does this thing really works?
What is my mistake?

DataStage: 7.5 at SunOS 5.8
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I think it should be:

Code: Select all

LOCATE FindDate IN TimeVariantStartDates<1> BY "DL" SETTING Position Then 
   Found = @TRUE 
End 
And it works great if done right.
Mamu Kim
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I corrected the code, it seems the "LOCATE FindDate IN TimeVariantStartDates" should have been "LOCATE FindDate IN TimeVariantStartDates<1>".

The choice between @VM and @FM was subtle, as the point is a delimited array of values. Somehow I mixed up the two and never posted the corrected logic.

Sorry for the delay in correcting old posts.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply