Best approach for determining changes

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
altis
Premium Member
Premium Member
Posts: 11
Joined: Thu Mar 29, 2007 1:38 am

Best approach for determining changes

Post by altis »

Hi everyone,

I am loading an Oracle table with an incremental set of data on a daily basis. Due to the nature of the incremental extract, some records will be new, some will be updates but many will be unchanged. I am planning to use the Change Data Capture stage to determine which of the incremental set of records are new, updates or unchanged.

The incoming incremental set of data will be much smaller than the target table that I need to look up to compare the records. The incremental record-set will be relatively small (10-20,000 records per day) and the table that I will be comparing to is much larger i.e. over a million rows and growing each day.

Is it the best approach to use the Change Data Capture stage when there is a marked difference in number of records between the 'before' and 'after' datasets?
Altis Consulting Pty Ltd
www.altis.com.au
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

i think you don't need a Change Data Capture stage.

what do you want to do after determining the records are chaned, unchanged or new?
do you want to maintain the history for updated records or simply updated the record.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You can either use a sparse lookup to send your 10,000-20,000 SQL compare statements to the target database or use a normal lookup that will copy the data to your DataStage server. If you just join to the target table and put a link into the Change Capture stage it will, at job startup time, copy the data to the DataStage server into temporary sorted tables. So you will be copying all your data before your job can start.

You can manage the shadow table on the DataStage server rather than automatically regenerate it each time your job runs. You can schedule a job to create the shadow table as a parallel dataset - this gives you control over the partitioning and sorting of that dataset. It lets you build the dataset at a time that suits you - for example you can build the dataset at a low usage time. You could also try to keep the shadow table up to date using the Change Apply Stage where you apply changes to both the target table and the shadow table. This is somewhat risky as the shadow table could fall out of synch so you should regularly refresh the shadow table from the target table (say once a week).

So there are lots of ways to do it. One million rows isn't a huge volume so try benchmarking the default design and see if the performance and anticipated growth fits your requirements.
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

CDC is fine. 1 million records is not much. We use CDC on much higher volume e.g up to 1 billion and it still performs. But make sure you configure your DS server properly and that you partition your data all the way through - meaning no bottlenecks. If your jobs are developed properly given the architecture of DS , it will scale ; caveat is you might need to add in more hardware (eg. more CPUs).
altis
Premium Member
Premium Member
Posts: 11
Joined: Thu Mar 29, 2007 1:38 am

Post by altis »

Cheers for the feedback.

James
Altis Consulting Pty Ltd
www.altis.com.au
prasannak
Premium Member
Premium Member
Posts: 56
Joined: Thu Mar 20, 2008 9:45 pm
Contact:

CDC

Post by prasannak »

Hi,

We are facing similar CDC related performance issues in some of our heavy volume jobs...

The scenario for us is : target dimension has about 20 million records.
Incoming source data(oracle) has a total of 20 million records..
Since the source is not a fact, there seems to be no way to filter the incoming dataset to bring into datastage...

One option being explored is: do this data subset operation in oracle by applying "minus" logic between source and target (source to target) to get the changed and new data from source.
This would substantially reduce the number of rows from source...But, put the burden on the oracle database instead...

Since CDC brings in all the data from target and loads into memory(20 million and growing), do the oracle minus operation from target to source and reduce the data comparison...

So, net effect is that we get the reduced volume from source and reduced volme from target and then do the CDC operation...

Obviously, this would put more burden on the oracle database, but, datastage can do this thing using the reduced volume much quicker...

This is all just theory so far...
We are going to POC it...
Any problem in this or alternate approach suggestions are welcome!
gabrielac
Participant
Posts: 29
Joined: Mon Sep 26, 2005 3:39 pm

Post by gabrielac »

It depends on the hardware of the Oracle DB and the Datastage Server, as well as the workload on both at the moment the job is ran. In our case, with similar volumes, it was quicker letting Datastage do all the work.
Post Reply