I have a PX job which has a named pipe as the input stream and which processes a lot of data during it's lengthy run. One of the columns in this file is "FILE_ID", which contains the reference to the actual source file. Due to the large amount of data (over a hundred million rows per day) I would like to make the commit frequency to Oracle as long as practicable, but for restartability I would like to perform an additional commit when the FILE_ID column value changes - this would be anywhere from 10 to 250,000 rows) so that we can design the restart procedures based on the FILE_ID.
I've not found a way to implement this type of commit policy; since explicitly calling a commit from another database stage in the same job will only commit that session, not the session with all the data.
I'm pretty certain that this type of commit triggering cannot be done in DataStage EE but would be overjoyed if someone could prove me wrong.
Performing database commits based on data - can it be done
Moderators: chulett, rschirm, roy
-
ray.wurlod
- Participant
- Posts: 54595
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
We have a number of files coming in (from the phone switches) with call information. Some files are just a couple of rows, others have hundreds of thousands. The PX job that processes these files uses a lot of lookup filesets and lookups to database tables - the startup time to load this information into memory can be 15 minutes.
A startup time of 15 minutes for a small file that gets processed in less than 1 second is wasteful; so the design of these jobs is to have the PX job start up and then we feed it the files as they are ready to be processed - hence using a named pipe as the input.
As part of this stream we pass it the FILE_ID, this changes a couple of thousand times per hour and for restartability we would like to ensure that the COMMIT to the database is triggered when the FILE_ID changes; so we know that if the process dies we will have only 1 incomplete file contents, namely the FILE_ID it was currently processing. Without this contents-based commit we could potentially have several files that need reprocessing (if the commit is 50,000 and the last files loaded are all smaller, we could have many files whose data isn't in the table.
I wonder if this type of scenario is common enough to make it worthwhile putting in an enhancement request for data-driven commits in the target database stages? Perhaps being able to add a condition statement similar in syntax to the transform stage constraints that would trigger a commit. It shouldn't be too complex to implement - but is it worthwhile?
A startup time of 15 minutes for a small file that gets processed in less than 1 second is wasteful; so the design of these jobs is to have the PX job start up and then we feed it the files as they are ready to be processed - hence using a named pipe as the input.
As part of this stream we pass it the FILE_ID, this changes a couple of thousand times per hour and for restartability we would like to ensure that the COMMIT to the database is triggered when the FILE_ID changes; so we know that if the process dies we will have only 1 incomplete file contents, namely the FILE_ID it was currently processing. Without this contents-based commit we could potentially have several files that need reprocessing (if the commit is 50,000 and the last files loaded are all smaller, we could have many files whose data isn't in the table.
I wonder if this type of scenario is common enough to make it worthwhile putting in an enhancement request for data-driven commits in the target database stages? Perhaps being able to add a condition statement similar in syntax to the transform stage constraints that would trigger a commit. It shouldn't be too complex to implement - but is it worthwhile?
</a>