Hybrid SCD Type handling...

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Hybrid SCD Type handling...

Post by kaps »

I am wondering what is the best way to deal with Hybrid SCD types in a single table in Datastage 8.7 ? When I say hybrid SCD types, I meant, I don't need to create a new version of a record for any column in a table changes rather only if certain column changes otherwise I just need to update the record.

I am thinking of joining the current data with previous day's data using a Join stage and then check for Type2 column changes by comparing each and every Type2 columns in stage variables but it seems like a cumbersome process if there are lot of columns involved and it makes it more difficult if the columns are nullable...

I don't think we can use Change Capture Stage as it does not tell us what column is changing...Is there a better approach to this ?

Just FYI...We are new to 8.7

Thanks for yout time..
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Same answer as in 8.5. In the SCD stage some columns can be identified for Type 1 changes and other columns can be identified for Type 2 changes, even in the same data set.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Ray,
Thanks for the reply. We have moved from 7.5 to 8.7 so need to play with SCD stage. Never used it before...I will try that...
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

I am configuring SCD stage in 8.7. Job does not fail but it does not give me the expected result. For testing purpose, I have changed a Type-2 column and ran the job which only generated 1 record in the DimUpdate link which is Insert record. It did not create the record for corresponding update(to close the existing record).

Job Design is :

Sequential File, SCD Stage with lookup as DB2 connector, DimUpdate seq file and FactUpdate seq file as two outputs.In SCD stage, Lookup tab, did set up the keys for Join. Whatever the fields I am joining on I defined as Business keys, defined other columns as Type1, Type2, Blank appropriately...

I hope I am clear with my question. What am I missing here ?

Thanks for your time...
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Hope this link may help you.


viewtopic.php?t=142965&highlight=SCd
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Sura - I got it...Seems like I have interchanged the links...

Another question is now we have to UPSERT into the Dim table. Correct ?
But one of the best practice says that DO NOT DO upserts as it has to check for every record to make sure if it's Insert or Update. So how does it affect the performance ?

Can we split the file into two using curr_indic column and do insert and updates ?
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

It is going to be Insert / Upsert is Business decision and i can't say YES / No.

In regards with best practice, i don't find much difference especially in this project where i am working.

Before we start using the SCD stage i did some test.

1. Using JOIN --> TFM and then 2 files for insert and update as a separate load.

2. Change Capture using UPSERT as well as insert and update as a separate load.

3. SCD UPSERT as well as insert and update as a separate load.

100,000 records for test having 10 - 12 columns, but i can't remember.

I can't find much difference in relates with performance.

Notes: No much difference I mean in one approach say it is completing in 5 min whereas in other it is taking either 20 min / 2 min.

Again it dependes PEOPLE vs PROJECT.

Decission is yours and this is my view.
Thanks
Ram
----------------------------------
Revealing your ignorance is fine, because you get a chance to learn.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

To work properly the SCD stage needs, in theory, to get the current row in to the dimension table before the next row is processed. Stay with upsert, array size 1 and auto-commit.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Ray - Is not what you are describing is true If I Insert directly into the Dimension table in the same job from SCD stage ? What difference does it make if I just load them to the file and do bulk insert and update as I have to update all the history record with Type1 values not just the last one.

Please correct me If I am wrong...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Under your scenario how would the SCD stage be able to look up a business key that occurs a second time in the stream input?

(In practice the dimension table is, at least partially, memory-mapped.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Yes. You are correct but in my case it's a batch process and we won't get two records for same business key in a load as we extract from OLTP systems at certain time in a day. Does it make sense ?
Post Reply