Page 1 of 1

Hybrid SCD Type handling...

Posted: Wed Apr 25, 2012 6:41 pm
by kaps
I am wondering what is the best way to deal with Hybrid SCD types in a single table in Datastage 8.7 ? When I say hybrid SCD types, I meant, I don't need to create a new version of a record for any column in a table changes rather only if certain column changes otherwise I just need to update the record.

I am thinking of joining the current data with previous day's data using a Join stage and then check for Type2 column changes by comparing each and every Type2 columns in stage variables but it seems like a cumbersome process if there are lot of columns involved and it makes it more difficult if the columns are nullable...

I don't think we can use Change Capture Stage as it does not tell us what column is changing...Is there a better approach to this ?

Just FYI...We are new to 8.7

Thanks for yout time..

Posted: Wed Apr 25, 2012 8:22 pm
by ray.wurlod
Same answer as in 8.5. In the SCD stage some columns can be identified for Type 1 changes and other columns can be identified for Type 2 changes, even in the same data set.

Posted: Wed Apr 25, 2012 9:01 pm
by kaps
Ray,
Thanks for the reply. We have moved from 7.5 to 8.7 so need to play with SCD stage. Never used it before...I will try that...

Posted: Mon Apr 30, 2012 4:36 pm
by kaps
I am configuring SCD stage in 8.7. Job does not fail but it does not give me the expected result. For testing purpose, I have changed a Type-2 column and ran the job which only generated 1 record in the DimUpdate link which is Insert record. It did not create the record for corresponding update(to close the existing record).

Job Design is :

Sequential File, SCD Stage with lookup as DB2 connector, DimUpdate seq file and FactUpdate seq file as two outputs.In SCD stage, Lookup tab, did set up the keys for Join. Whatever the fields I am joining on I defined as Business keys, defined other columns as Type1, Type2, Blank appropriately...

I hope I am clear with my question. What am I missing here ?

Thanks for your time...

Posted: Mon Apr 30, 2012 5:06 pm
by SURA
Hope this link may help you.


viewtopic.php?t=142965&highlight=SCd

Posted: Mon Apr 30, 2012 6:28 pm
by kaps
Sura - I got it...Seems like I have interchanged the links...

Another question is now we have to UPSERT into the Dim table. Correct ?
But one of the best practice says that DO NOT DO upserts as it has to check for every record to make sure if it's Insert or Update. So how does it affect the performance ?

Can we split the file into two using curr_indic column and do insert and updates ?

Posted: Mon Apr 30, 2012 8:30 pm
by SURA
It is going to be Insert / Upsert is Business decision and i can't say YES / No.

In regards with best practice, i don't find much difference especially in this project where i am working.

Before we start using the SCD stage i did some test.

1. Using JOIN --> TFM and then 2 files for insert and update as a separate load.

2. Change Capture using UPSERT as well as insert and update as a separate load.

3. SCD UPSERT as well as insert and update as a separate load.

100,000 records for test having 10 - 12 columns, but i can't remember.

I can't find much difference in relates with performance.

Notes: No much difference I mean in one approach say it is completing in 5 min whereas in other it is taking either 20 min / 2 min.

Again it dependes PEOPLE vs PROJECT.

Decission is yours and this is my view.

Posted: Mon Apr 30, 2012 8:32 pm
by ray.wurlod
To work properly the SCD stage needs, in theory, to get the current row in to the dimension table before the next row is processed. Stay with upsert, array size 1 and auto-commit.

Posted: Tue May 01, 2012 9:30 am
by kaps
Ray - Is not what you are describing is true If I Insert directly into the Dimension table in the same job from SCD stage ? What difference does it make if I just load them to the file and do bulk insert and update as I have to update all the history record with Type1 values not just the last one.

Please correct me If I am wrong...

Posted: Tue May 01, 2012 4:13 pm
by ray.wurlod
Under your scenario how would the SCD stage be able to look up a business key that occurs a second time in the stream input?

(In practice the dimension table is, at least partially, memory-mapped.)

Posted: Wed May 02, 2012 9:57 am
by kaps
Yes. You are correct but in my case it's a batch process and we won't get two records for same business key in a load as we extract from OLTP systems at certain time in a day. Does it make sense ?