Page 1 of 1

Processing multiple business key changes in a Type2 SCD

Posted: Sat Sep 04, 2010 9:19 am
by chulett
Looking for some conceptual help on a particular Type2 SCD issue, regardless of the tool used. Technically, I have this with Informatica right now but it's not all that different and looking more for methodology than implementation specifics right now. Besides, the peoples are way more smarter here. :wink:

Specifically, I have a job/mapping that does everything just fine - insert new records, insert/update for changes, logical deletes, the whole nine yards - as long as there is only one change per business key to be processed in any given run. When I stack up several change cycles in my source, things don't quite work so well. Design (abstracted) looks like this:

Code: Select all

source -> lookup -> router +----- new record insert ----------> target
                           +----change +-> insert new change -> target
                                       +-> update old record -> target
                           +----logical delete update --------> target
The lookup is dynamic and determines the type of change for me, the router splits appropriate information off to further pipelines. The targets are all Oracle doing plain old sql inserts or updates, nothing fancy and all four hook to the same table, the dimension target. That last bit is the rub, it seems.

Much like the OE stages, those targets all create separate connections / sessions into Oracle, so separate transactions that don't have a clue what the others are doing. In Server I would use a single target stage with 4 links and be done, here (like PX) that's not possible.

This all works fine for singleton changes. Stack up multiple changes for the same business key and... workie not so much. Updates of records that may or may not have been inserted yet are not reliable. With the difference in metadata between the inserts and updates, I can't really funnel things together, never mind the fact that there's no such concept with Infa. So, how are people handling this?

I know I could land the updates and only complete processing for the inserts here and then build a separate mapping to handle the updates after all of the inserts have been committed. I'm looking for a way to do this all at once, if possible. We've talked about 'folding' the updates and inserts together, if that makes sense. Meaning, only send inserts to the database but make them look like they would after the in-flight change data had been applied - two logical operations, one physical operation. Just not quite sure how to manage all that and cover 100% of the scenarios we could run into over time. :?

They have variable ports just like stage variables, but no constraint concept. For that, I would need to filter post-transformation. Either that or aggregate for the 'last' record per business key?

For some reason, this really isn't coming together for me, long days with short nights and paint fumes seem to be wreaking havoc on my cognitive abilities at the moment... thus the reach out to the forum.

Any thoughts on this would be most appreciated.

Posted: Sat Sep 04, 2010 5:49 pm
by ray.wurlod
My current site performs the "expiry" updates first, and queues up the inserts. It seems not only to perform better but is also easier to unwind (from a snapshot) if the update phase fails. Since the inserts all have new surrogate keys, they can happily be bulk loaded. Database is DB2, but that should not be relevant.

Posted: Sat Sep 04, 2010 8:22 pm
by chulett
That sounds fine but does it really handle multiple changes to the same business key? It doesn't seem like I can save the inserts until last unless I know there will always only be one per key, otherwise those 'expiry' updates would be for inserts I haven't done yet. Yes? :?

Posted: Sun Sep 05, 2010 1:15 am
by ray.wurlod
That's true (at least at first thought). Sounds like you need to process into a working area that has been pre-loaded from target, before the final load, as you suggested. I can not think of an approach in a parallel job that would allow you to do this all at once.

Curiously there IS a one-job solution using a server job, where a hashed file is pre-loaded from target and written to and read from the same Transformer stage (with caching disabled).

Posted: Sun Sep 05, 2010 7:25 am
by chulett
Yah, noted earlier that if this was a Server job, I'd have beeen done with it in two shakes. :wink:

Still pondering this and if it is really worth all the effort for a one job solution.

Posted: Sun Sep 05, 2010 2:59 pm
by ray.wurlod
Pity your initial post didn't begin "I have a requirement...". I could have answered that! ;)

Posted: Sun Sep 05, 2010 5:23 pm
by chulett
Yah... I wonder what that answer would have been? :lol:

Posted: Sun Sep 05, 2010 6:12 pm
by ray.wurlod
Actually I just re-read the original post. It seems to me that this is just what the SCD stage is supposed to be able to do - by caching the (dimension) table in memory.

Posted: Sun Sep 05, 2010 8:08 pm
by chulett
Too bad then that I'm not actually doing this in DataStage. :(