Performing database commits based on data - can it be done

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Performing database commits based on data - can it be done

Post by ArndW »

I have a PX job which has a named pipe as the input stream and which processes a lot of data during it's lengthy run. One of the columns in this file is "FILE_ID", which contains the reference to the actual source file. Due to the large amount of data (over a hundred million rows per day) I would like to make the commit frequency to Oracle as long as practicable, but for restartability I would like to perform an additional commit when the FILE_ID column value changes - this would be anywhere from 10 to 250,000 rows) so that we can design the restart procedures based on the FILE_ID.

I've not found a way to implement this type of commit policy; since explicitly calling a commit from another database stage in the same job will only commit that session, not the session with all the data.

I'm pretty certain that this type of commit triggering cannot be done in DataStage EE but would be overjoyed if someone could prove me wrong.
ray.wurlod
Participant
Posts: 54595
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No joy from me. I'm also convinced that there's no mechanism in DataStage - apart, perhaps, from a Build stage - that can do what you seek.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:? Could you perhaps split or partition your data on this FILE_ID field? Or build some sort of iterative process? Also be curious how many discrete values you'd have in that field in a typical load.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

We have a number of files coming in (from the phone switches) with call information. Some files are just a couple of rows, others have hundreds of thousands. The PX job that processes these files uses a lot of lookup filesets and lookups to database tables - the startup time to load this information into memory can be 15 minutes.
A startup time of 15 minutes for a small file that gets processed in less than 1 second is wasteful; so the design of these jobs is to have the PX job start up and then we feed it the files as they are ready to be processed - hence using a named pipe as the input.
As part of this stream we pass it the FILE_ID, this changes a couple of thousand times per hour and for restartability we would like to ensure that the COMMIT to the database is triggered when the FILE_ID changes; so we know that if the process dies we will have only 1 incomplete file contents, namely the FILE_ID it was currently processing. Without this contents-based commit we could potentially have several files that need reprocessing (if the commit is 50,000 and the last files loaded are all smaller, we could have many files whose data isn't in the table.

I wonder if this type of scenario is common enough to make it worthwhile putting in an enhancement request for data-driven commits in the target database stages? Perhaps being able to add a condition statement similar in syntax to the transform stage constraints that would trigger a commit. It shouldn't be too complex to implement - but is it worthwhile?
Post Reply