Page 1 of 1

Slowly changing dimension stage and commit interval

Posted: Tue Apr 19, 2011 7:00 pm
by abc123
If I have 3 records for the same business key in the incoming source and I have APT_ORAUPSERT_COMMIT_ROW_INTERVAL of 5000, how would the SCD stage handle such a situation? Example:

BkCol ValueCol1 ValueCol2
1 A Ab
2 C Ba
2 C Ca
2 C Cb

The first record should be an insert and the rest should be updates. Will this be handled automatically(I don't think so but would like to hear your opinions). Two questions:

1)Setting APT_ORAUPSERT_COMMIT_ROW_INTERVAL to 1 would make insertion/updation extremely slow.
How would you handle it?

2)How would you handle the multi-node situation?

Posted: Tue Apr 19, 2011 8:16 pm
by kduke
All of the same keys have to be on one node. So you are force to sort by the key and keep it in change date order otherwise you final result not accurate. You still have to commit one row at a time.

The only other solution is to remove dupes. Bring in only one at a time. This means you have to make multiple passes by you can commit 5000 rows at a time.

Re: Slowly changing dimension stage and commit interval

Posted: Tue Apr 19, 2011 8:38 pm
by ray.wurlod
abc123 wrote:Setting APT_ORAUPSERT_COMMIT_ROW_INTERVAL to 1 would make insertion/updation extremely slow.
Curiously this is not always the case.

Posted: Wed Apr 20, 2011 7:56 am
by abc123
Kim, I am doing the ORDER BY in the source SQL. I had hash partitioning on the Input tab of the next stage but I removed it for performance reasons. Would doing the ORDER BY put all same keys in the same node without specifying hash partitioning? I wouldn't think so.

Ray, that's an interesting piece of information. What do you think the best peformance percentage difference might be, between 5000 commits and 1 commit? Just curious.

Posted: Wed Apr 20, 2011 4:35 pm
by ray.wurlod
In an Upsert scenario, each individual update may take some considerable time (in database terms). If you send 5000 before you commit, that wait time is multiplied and seen as a performance degradation (for example because rows/sec goes down - the clock keeps running but no more rows are being processed while waiting).

Posted: Wed Apr 20, 2011 7:18 pm
by kduke
Order By in SQL not always same as hash sort in job. But if it works then don't worry. The beauty of DataStage is you can easily try it both ways and tell us what performs best. Doesn't take long to create 2 versions of the job.