Slowly changing dimension stage and commit interval

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Slowly changing dimension stage and commit interval

Post by abc123 »

If I have 3 records for the same business key in the incoming source and I have APT_ORAUPSERT_COMMIT_ROW_INTERVAL of 5000, how would the SCD stage handle such a situation? Example:

BkCol ValueCol1 ValueCol2
1 A Ab
2 C Ba
2 C Ca
2 C Cb

The first record should be an insert and the rest should be updates. Will this be handled automatically(I don't think so but would like to hear your opinions). Two questions:

1)Setting APT_ORAUPSERT_COMMIT_ROW_INTERVAL to 1 would make insertion/updation extremely slow.
How would you handle it?

2)How would you handle the multi-node situation?
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

All of the same keys have to be on one node. So you are force to sort by the key and keep it in change date order otherwise you final result not accurate. You still have to commit one row at a time.

The only other solution is to remove dupes. Bring in only one at a time. This means you have to make multiple passes by you can commit 5000 rows at a time.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: Slowly changing dimension stage and commit interval

Post by ray.wurlod »

abc123 wrote:Setting APT_ORAUPSERT_COMMIT_ROW_INTERVAL to 1 would make insertion/updation extremely slow.
Curiously this is not always the case.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

Kim, I am doing the ORDER BY in the source SQL. I had hash partitioning on the Input tab of the next stage but I removed it for performance reasons. Would doing the ORDER BY put all same keys in the same node without specifying hash partitioning? I wouldn't think so.

Ray, that's an interesting piece of information. What do you think the best peformance percentage difference might be, between 5000 commits and 1 commit? Just curious.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In an Upsert scenario, each individual update may take some considerable time (in database terms). If you send 5000 before you commit, that wait time is multiplied and seen as a performance degradation (for example because rows/sec goes down - the clock keeps running but no more rows are being processed while waiting).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Order By in SQL not always same as hash sort in job. But if it works then don't worry. The beauty of DataStage is you can easily try it both ways and tell us what performs best. Doesn't take long to create 2 versions of the job.
Mamu Kim
Post Reply