Processing multiple business key changes in a Type2 SCD

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Processing multiple business key changes in a Type2 SCD

Post by chulett »

Looking for some conceptual help on a particular Type2 SCD issue, regardless of the tool used. Technically, I have this with Informatica right now but it's not all that different and looking more for methodology than implementation specifics right now. Besides, the peoples are way more smarter here. :wink:

Specifically, I have a job/mapping that does everything just fine - insert new records, insert/update for changes, logical deletes, the whole nine yards - as long as there is only one change per business key to be processed in any given run. When I stack up several change cycles in my source, things don't quite work so well. Design (abstracted) looks like this:

Code: Select all

source -> lookup -> router +----- new record insert ----------> target
                           +----change +-> insert new change -> target
                                       +-> update old record -> target
                           +----logical delete update --------> target
The lookup is dynamic and determines the type of change for me, the router splits appropriate information off to further pipelines. The targets are all Oracle doing plain old sql inserts or updates, nothing fancy and all four hook to the same table, the dimension target. That last bit is the rub, it seems.

Much like the OE stages, those targets all create separate connections / sessions into Oracle, so separate transactions that don't have a clue what the others are doing. In Server I would use a single target stage with 4 links and be done, here (like PX) that's not possible.

This all works fine for singleton changes. Stack up multiple changes for the same business key and... workie not so much. Updates of records that may or may not have been inserted yet are not reliable. With the difference in metadata between the inserts and updates, I can't really funnel things together, never mind the fact that there's no such concept with Infa. So, how are people handling this?

I know I could land the updates and only complete processing for the inserts here and then build a separate mapping to handle the updates after all of the inserts have been committed. I'm looking for a way to do this all at once, if possible. We've talked about 'folding' the updates and inserts together, if that makes sense. Meaning, only send inserts to the database but make them look like they would after the in-flight change data had been applied - two logical operations, one physical operation. Just not quite sure how to manage all that and cover 100% of the scenarios we could run into over time. :?

They have variable ports just like stage variables, but no constraint concept. For that, I would need to filter post-transformation. Either that or aggregate for the 'last' record per business key?

For some reason, this really isn't coming together for me, long days with short nights and paint fumes seem to be wreaking havoc on my cognitive abilities at the moment... thus the reach out to the forum.

Any thoughts on this would be most appreciated.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

My current site performs the "expiry" updates first, and queues up the inserts. It seems not only to perform better but is also easier to unwind (from a snapshot) if the update phase fails. Since the inserts all have new surrogate keys, they can happily be bulk loaded. Database is DB2, but that should not be relevant.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That sounds fine but does it really handle multiple changes to the same business key? It doesn't seem like I can save the inserts until last unless I know there will always only be one per key, otherwise those 'expiry' updates would be for inserts I haven't done yet. Yes? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's true (at least at first thought). Sounds like you need to process into a working area that has been pre-loaded from target, before the final load, as you suggested. I can not think of an approach in a parallel job that would allow you to do this all at once.

Curiously there IS a one-job solution using a server job, where a hashed file is pre-loaded from target and written to and read from the same Transformer stage (with caching disabled).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yah, noted earlier that if this was a Server job, I'd have beeen done with it in two shakes. :wink:

Still pondering this and if it is really worth all the effort for a one job solution.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Pity your initial post didn't begin "I have a requirement...". I could have answered that! ;)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yah... I wonder what that answer would have been? :lol:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Actually I just re-read the original post. It seems to me that this is just what the SCD stage is supposed to be able to do - by caching the (dimension) table in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Too bad then that I'm not actually doing this in DataStage. :(
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply