Using hash files instead of UV tables for multirow

kcbland · Post by **kcbland** » Sun Aug 24, 2003 7:09 am

Here's the subroutine for converting a reverse sorted and grouped comma separated sequential file of natural to surrogate key columns described in the previous post. You could paste this logic into a subroutine. The InputArg value is the fully qualified sequential file name with a ".???" extension. The logic removes the current extension and adds ".rlp".

This subroutine could be called in the after routine of the job that selects the rows and outputs to the sequential file. You could then load this data into a hash file for reference purposes. IMPORTANT! Dates should be in YYYY-MM-DD or YYYY-MM-DD HH24:MI:SS format!

Code: Select all

    FunctionName = "KBASkeyRollup"
      RollupFileDtl = InputArg
      RollupFile = RollupFileDtl[1,LEN(RollupFileDtl)-3]:"rlp"

      Call DSLogInfo("Opening detail file [":RollupFileDtl:"]", FunctionName)
      OPENSEQ RollupFileDtl TO S.Detail Else
         Call DSLogFatal("Unable to open detail file: ":RollupFileDtl, FunctionName)
      End
      Call DSLogInfo("Opened detail file [":RollupFileDtl:"]", FunctionName)

      Call DSLogInfo("Opening rollup file [":RollupFile:"]", FunctionName)
      ValidOpen = @FALSE
      OPENSEQ RollupFile TO S.Rollup Then
         ValidOpen = @TRUE
      End Else
         CREATE S.Rollup Then ValidOpen = @TRUE
      End
      If NOT(ValidOpen) Then
         Call DSLogFatal("Could not create the file: ":RollupFile, FunctionName)
      End
      Call DSLogInfo("Opened rollup file [":RollupFile:"]", FunctionName)

      RowCount = 0
      OutputRowCount = 0
      RollupRow = ""
      LastRollupID = "LAST"
      Done = @FALSE
      Loop
         READSEQ row FROM S.Detail Else Done = @TRUE
      Until Done Do
         RowCount += 1
         source_system_code = field(row,',',1)
         source_system_key = field(row,',',2)
         surrogate_key = field(row,',',3)
         time_variant_start_date = field(row,',',4)
         time_variant_version_nbr = field(row,',',5)

         RollupID = field(row,',',1,2)
         If LastRollupID = "LAST" Then LastRollupID = RollupID
         If RollupID # LastRollupID Then
            OutputRowCount += 1
            CONVERT @AM TO "," IN RollupRow
            WRITESEQ RollupRow ON S.Rollup Else
               Call DSLogFatal("Unable to write to file ":RollupFile, FunctionName)
            END
            RollupRow = ""
            LastRollupID = RollupID
         End
         If RollupRow = "" Then Delimiter = "" Else Delimiter = "|"
         RollupRow = source_system_code
         RollupRow = source_system_key
         RollupRow = RollupRow:Delimiter:surrogate_key
         RollupRow = RollupRow:Delimiter:time_variant_start_date
         RollupRow = RollupRow:Delimiter:time_variant_version_nbr
      Repeat
      If RollupRow # "" Then
         OutputRowCount += 1
         CONVERT @AM TO "," IN RollupRow
         WRITESEQ RollupRow ON S.Rollup Else
            Call DSLogFatal("Unable to write to file ":RollupFile, FunctionName)
         END
      End
      WEOFSEQ S.Rollup

      CLOSESEQ S.Rollup
      CLOSESEQ S.Detail

      Call DSLogInfo("Finished creating rollup file: [":RollupFile:"]", FunctionName)
      Call DSLogInfo("Input Rows Processed: [":RowCount:"]", FunctionName)
      Call DSLogInfo("Output Rows Written: [":OutputRowCount:"]", FunctionName)

      ErrorCode = 0                      ; * set this to non-zero to stop the stage/job

Kenneth Bland

kcbland · Post by **kcbland** » Sun Aug 24, 2003 7:13 am

Here's a function you could use to pick the appropriate surrogate key for a business date. Call the function, passing it the business date, the hash referenced row of surrogate keys, and the hash referenced row of time variant start dates for those surrogate keys. The function scans the array of dates, locating the point in time that the business date would have fallen, and returns the corresponding surrogate key effective at that time.
I call this function KBAVariantSkey:

Code: Select all

      FindDate = Arg1
      SurrKeys = Arg2
      TimeVariantStartDates = Arg3

      CONVERT "|" TO @VM IN TimeVariantStartDates

      Begin Case
         Case FindDate = "" OR ISNULL(FindDate)
            SurrKey = 0
         Case SurrKeys = "" OR ISNULL(SurrKeys)
            SurrKey = 0
         Case TimeVariantStartDates = "" OR ISNULL(TimeVariantStartDates)
            SurrKey = 0
         Case @TRUE
            Found = @FALSE
            LOCATE FindDate IN TimeVariantStartDates<1> BY "DL" SETTING Position Then
               Found = @TRUE
            End
            SurrKey = FIELD(SurrKeys, "|", Position)
            If NOT((Found OR Position # 0) AND SurrKey # "") Then
               SurrKeyCnt = DCOUNT(SurrKeys, "|")
               SurrKey = FIELD(SurrKeys, "|", SurrKeyCnt)
            End
      End Case

      Ans = SurrKey

Kenneth Bland

kcbland · Post by **kcbland** » Sun Aug 24, 2003 7:15 am

Here's a function to return the most currently assigned surrogate key. Simply call this function passing it the hash reference list of surrogate keys. Since it's pre-sorted reverse-chronologically, the new surrogate must be in the first position.

I called this function KBACurrentSkey:

Code: Select all

      Ans = Field(Arg1, "|", 1)

Kenneth Bland

kcbland · Post by **kcbland** » Sun Aug 24, 2003 7:25 am

The thing to note from what I'm trying to do here is use hash files as they are meant to be used IMO. Hash files are optimized structures for one-to-one reference lookups. When you start adding secondary indexes, you're moving away from using an ETL tool and instead are learning about database technology.

The time spent pre-sorting of the data and then rolling into a lookup can vastly be gained back once this type of optimized structure is being used for reference purposes. When the same slowly changing dimension rows are repeatedly being queried, the database is wasting time and i/o doing the same sorting and grouping and max logic over and over. This SQL is some of the worst performing SQL, and without a specially contrived multi-value surrogate key cross-reference table every row would have the following SQL executed for each foreign key substitution:

Code: Select all

SELECT MAX(skeyCNTRY.CNTRY_ID) from skeyCNTRY
WHERE skeyCNTRY.Source_System_Code = GEO.Source_System_Code
      AND skeyCNTRY.Source_System_Key = GEO.Source_System_Key
      AND skeyCNTRY.Time_Variant_Start_Date <= yourfactdate

kduke · Post by **kduke** » Fri Sep 03, 2004 10:56 am

Ken

Look at this topic:

viewtopic.php?t=88934

You can do the rollup in a job. You can also use @VM and not "|". Makes this a lot easier.

ryoung011 · Post by **ryoung011** » Thu Dec 09, 2004 3:28 pm

Just a note to future users and FYI...

I had to use @FM in DS 7x (Linux RH AS 3.0) to get this to work. @VM would not work correctly in the locate function.

kduke · Post by **kduke** » Thu Dec 09, 2004 3:46 pm

Sure it does. LOCATE is smart. If you LOCATE in Rec then uses @FM. LOCATE in Rec<2> will look for @VM as separators. LOCATE in Rec<2,3> will use @SVM as a separator.

The learning curve for these little details is steep. Ken, Ray and others have used Universe and other similar databases for years. Most of us had a mentor to teach these details. It maybe much harder for you to learn this when your mentor is on the other side of the world and he cannot watch you code these problems. You got it to work which is awesome.

I think we need to post the jobs and routines for users to see a complete solution. Ken, Ray and others have tried to bridge this gap. My hope is Ascential helps Ray and Ken to release as much technical information as possible. All they want to do is help us become better developers with DataStage. Better developers should equate to happier customers. Right?

willpeng · Post by **willpeng** » Tue Feb 15, 2005 2:41 pm

kduke wrote:All they want to do is help us become better developers with DataStage. Better developers should equate to happier customers. Right?

Right!

Just one word... BellSouth

No matter how good we are, they were not happy with what we did.

kduke · Post by **kduke** » Tue Feb 15, 2005 5:12 pm

The solution at Bellsouth was a political one and not an ETL one. Our team was still a good team.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 15, 2005 7:51 pm

Didn't Scott Adams (author of Dilbert) once work there too?

him121 · Post by **him121** » Mon Aug 22, 2005 9:57 pm

Thanks Huy for help.

But I used Universe stage ..with User difined query like
Select code from HASH where L_lmt <= input And U_lmt >= input;
and I did Lookup with UV Stage directly...
but here i needed to conver ..decimal values into Varchar...

Thanks...anyway

dial70 · Post by **dial70** » Tue Nov 22, 2005 4:00 am

I checked the function to get the surrkey, but

kcbland wrote:
LOCATE FindDate IN TimeVariantStartDates BY "DL" SETTING Position Then
Found = @TRUE
End

Kenneth Bland

LOCATE ever returns the second SurrKey, if FindDate is not greater than the first Date.
iE: FindDate = 2005-10-10
TimeVariantStartDates = 2005-11-15|2005-11-01|2005-10-15|2005-10-01
SurrKeys= 40|30|20|10

Result = 30 !

Does this thing really works?
What is my mistake?

DataStage: 7.5 at SunOS 5.8

kduke · Post by **kduke** » Tue Nov 22, 2005 7:42 am

I think it should be:

Code: Select all

LOCATE FindDate IN TimeVariantStartDates<1> BY "DL" SETTING Position Then 
   Found = @TRUE 
End

And it works great if done right.

kcbland · Post by **kcbland** » Tue Nov 22, 2005 7:43 am

I corrected the code, it seems the "LOCATE FindDate IN TimeVariantStartDates" should have been "LOCATE FindDate IN TimeVariantStartDates<1>".

The choice between @VM and @FM was subtle, as the point is a delimited array of values. Somehow I mixed up the two and never posted the corrected logic.

Sorry for the delay in correcting old posts.