Insert/Update Job Design: Comments?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Insert/Update Job Design: Comments?

Post by lshort »

I am experimenting with the Compare Stage (trying to determine its usefullness). I am exploring the following scenario.
Requirement:
A. Insert of all New records to DataWarehouse
B. Update DW when changes are made to specific columns ONLY.
Any thoughts or comments? :)

Step 1. Obtain seq file containing TODAY transactions.

Step 2.

Job1: DBMS lookup to determine INSERTS (key not found)

Source Link= Today.seq
Ref Link = DW
<<LookupStage>>
Output Link1 [NOT.FOUND]= Inserts.ds
Output Link2 [FOUND]= CouldBeUpdate.ds
Output Link3 [FOUND]= FoundInDw.ds

Job2: Compare-Determine which rows have a change to the required columns

Source Link1= CouldBeUpdate.ds
Source Link2= FoundInDw.ds
<<CompareStage>>
Output Link =CompareResults.ds

Job3: Seperate updates from no change rows

Source Link1= CompareResults.ds
<<SwitchStage>>
Output Link1 [Result Code<>0] = Updates.ds
Output Link2 [rej] = NoChange.ds

Step 3.
Do Inserts and Updates to the DW
Inserts.ds
Updates.ds
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Post by richdhan »

Hi lshort,

Look into the CDC stage. The documentation in parjdev.pdf gives additional information on CDC stage. It is used to identify new inserts, updates as well as deletes.

You can finish it in one job. The CDC stage requires 2 inputs. Use the flat file as one input and the DBMS data as the other input. Define the keys and values in CDC stage. The CDC stage provides an additional column known as change_code which is used to identify an insert, update, delete or an exact copy.

HTH
Rich
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

I'll look into it. Thanks for the info. 8)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
gh_amitava
Participant
Posts: 75
Joined: Tue May 13, 2003 4:14 am
Location: California
Contact:

Post by gh_amitava »

Hi,

You can use the "Upsert" feature of DataStage.

Regards
Amitava
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

True. I could use upsert. However, if I should need to be able to track the number of inserts versus updates upsert will not allow for it. 8)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

richdhan wrote:Hi lshort,

Look into the CDC stage. The documentation in parjdev.pdf gives additional information on CDC stage. It is used to identify new inserts, updates as well as deletes.

Rich
Pardon a potentially dumb question. What is the CDC Stage? I couldn't find references to it in the parjdev manual. Is this the change capture stage?

We are looking into a similar situation (insert new/update only if certain fields change) and I found this posting in a search.
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Post by richdhan »

Hi Carlson

Yes, I was referring to the Change Capture stage.

The Change Capture stage will work perfectly for your situation. You have to define the fields in the change values section of the CDC stage and change_code additional column will detect the update.

HTH
Rich
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

With respect to the performance does the change capture stage provide better performance when compared to the lookup stage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It provides different functionality, so you'd be trying to compare apples with oranges even if you did have a definition of "performance".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I always felt you were better off doing surrogate key lookup first and then doing change data capture on the rows that already exist and sending the new rows directly down an insert path. A surrogate key lookup only retrieves a couple columns from the database while CDC returns every column. So lookup followed by CDC should be faster then CDC followed by lookup.

You can't get away with a CDC alone unless you have built dimensions without surrogate keys. A CDC stage can bring in rows from the dimension for comparison but it will not let you take or use a dimension field. You can only output rows from the "After" link, not the "Before" link.

We had shocking performance of upserts on DB2 and had to make sure we used seperate insert and update paths. The initial load of a table is the worst performer, DB2 has to perform full table scans for every row as it cannot use an index until stats have been run on the table.
balakot
Participant
Posts: 2
Joined: Thu Oct 07, 2004 6:15 pm

a different type of question on this

Post by balakot »

i was working on some thing like the same job that you guys were discussing and in the end we want to write to the data base only there records that have a same data type and the ones that dont have the data types should be dropped out and these records should be written out to the reject file.
Is there any function out there like get stage info or some thin that can give us some details about the record be dropped since there is a different data type out there?

Thanks
-babloo
Post Reply