Hi everyone,
I am loading an Oracle table with an incremental set of data on a daily basis. Due to the nature of the incremental extract, some records will be new, some will be updates but many will be unchanged. I am planning to use the Change Data Capture stage to determine which of the incremental set of records are new, updates or unchanged.
The incoming incremental set of data will be much smaller than the target table that I need to look up to compare the records. The incremental record-set will be relatively small (10-20,000 records per day) and the table that I will be comparing to is much larger i.e. over a million rows and growing each day.
Is it the best approach to use the Change Data Capture stage when there is a marked difference in number of records between the 'before' and 'after' datasets?
Best approach for determining changes
Moderators: chulett, rschirm, roy
Best approach for determining changes
Altis Consulting Pty Ltd
www.altis.com.au
www.altis.com.au
-
- Premium Member
- Posts: 783
- Joined: Mon Jan 16, 2006 10:17 pm
- Location: Sydney, Australia
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
You can either use a sparse lookup to send your 10,000-20,000 SQL compare statements to the target database or use a normal lookup that will copy the data to your DataStage server. If you just join to the target table and put a link into the Change Capture stage it will, at job startup time, copy the data to the DataStage server into temporary sorted tables. So you will be copying all your data before your job can start.
You can manage the shadow table on the DataStage server rather than automatically regenerate it each time your job runs. You can schedule a job to create the shadow table as a parallel dataset - this gives you control over the partitioning and sorting of that dataset. It lets you build the dataset at a time that suits you - for example you can build the dataset at a low usage time. You could also try to keep the shadow table up to date using the Change Apply Stage where you apply changes to both the target table and the shadow table. This is somewhat risky as the shadow table could fall out of synch so you should regularly refresh the shadow table from the target table (say once a week).
So there are lots of ways to do it. One million rows isn't a huge volume so try benchmarking the default design and see if the performance and anticipated growth fits your requirements.
You can manage the shadow table on the DataStage server rather than automatically regenerate it each time your job runs. You can schedule a job to create the shadow table as a parallel dataset - this gives you control over the partitioning and sorting of that dataset. It lets you build the dataset at a time that suits you - for example you can build the dataset at a low usage time. You could also try to keep the shadow table up to date using the Change Apply Stage where you apply changes to both the target table and the shadow table. This is somewhat risky as the shadow table could fall out of synch so you should regularly refresh the shadow table from the target table (say once a week).
So there are lots of ways to do it. One million rows isn't a huge volume so try benchmarking the default design and see if the performance and anticipated growth fits your requirements.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Charter Member
- Posts: 193
- Joined: Tue Sep 05, 2006 8:01 pm
- Location: Australia
CDC is fine. 1 million records is not much. We use CDC on much higher volume e.g up to 1 billion and it still performs. But make sure you configure your DS server properly and that you partition your data all the way through - meaning no bottlenecks. If your jobs are developed properly given the architecture of DS , it will scale ; caveat is you might need to add in more hardware (eg. more CPUs).
CDC
Hi,
We are facing similar CDC related performance issues in some of our heavy volume jobs...
The scenario for us is : target dimension has about 20 million records.
Incoming source data(oracle) has a total of 20 million records..
Since the source is not a fact, there seems to be no way to filter the incoming dataset to bring into datastage...
One option being explored is: do this data subset operation in oracle by applying "minus" logic between source and target (source to target) to get the changed and new data from source.
This would substantially reduce the number of rows from source...But, put the burden on the oracle database instead...
Since CDC brings in all the data from target and loads into memory(20 million and growing), do the oracle minus operation from target to source and reduce the data comparison...
So, net effect is that we get the reduced volume from source and reduced volme from target and then do the CDC operation...
Obviously, this would put more burden on the oracle database, but, datastage can do this thing using the reduced volume much quicker...
This is all just theory so far...
We are going to POC it...
Any problem in this or alternate approach suggestions are welcome!
We are facing similar CDC related performance issues in some of our heavy volume jobs...
The scenario for us is : target dimension has about 20 million records.
Incoming source data(oracle) has a total of 20 million records..
Since the source is not a fact, there seems to be no way to filter the incoming dataset to bring into datastage...
One option being explored is: do this data subset operation in oracle by applying "minus" logic between source and target (source to target) to get the changed and new data from source.
This would substantially reduce the number of rows from source...But, put the burden on the oracle database instead...
Since CDC brings in all the data from target and loads into memory(20 million and growing), do the oracle minus operation from target to source and reduce the data comparison...
So, net effect is that we get the reduced volume from source and reduced volme from target and then do the CDC operation...
Obviously, this would put more burden on the oracle database, but, datastage can do this thing using the reduced volume much quicker...
This is all just theory so far...
We are going to POC it...
Any problem in this or alternate approach suggestions are welcome!