Designing a Generic Datastage job for multiple input sources
Posted: Fri Oct 13, 2006 6:18 am
The following requirements are there in our project:
Multiple suppliers send files at various times of the day to defined locations. The formats of the files sent by the suppliers are all different. These have to be processed and ingested ultimately into a set of tables, as and whe they arrive at the specified location.
The easy way is of course to design a job for each file and a job sequence with a waitfor activity for the arrival of the file. but this increases the development time, makes the job repetitive and creates maintenance hassles. So we are trying to design the jobs as follows:
1. Create a sequence for taking all the files from all suppliers and creating an uniform staging file, which is basically the universe of all the fields sent by the suppliers
2. Take the fields from the staging file and map it into the db tables as required.
We hit some problems in the scheduling. Basically the job we design either waits for arrival of all the files from the suppliers before it can start, or if we make the sequence start based on arrival of any of the files, then while the sequence is executing due to arrival of one file, subsequent files do not get picked up.
I am aware of scheduling DS jobs using unix scripts and all that, but don't know how. I actually want to schedule the script from the windows scheduler, so that the script can execute the generic job (with a different alias, depending on which input file it has been scheduled to processing, passing a list of arguments to be used by the job as parameter). I am looking for some advice/help on how to implement this. If anyone has sample scripts, c++ programs etc which can be used to schedule, or can direct me to the same, I will be most obliged.
Then there is the second problem of course! If there are multiple instances of the job that creates the staging file, does anything get overwritten? I mean, basically all the instances are trying to write into the same file concurrently so there are bound to be currency issues. Further the staging file has to be accessed to upload the tables, again as and when the staging file is not empty. So essentially it is a problem of how to schedule the same job(s) repeatedly, but to process different sets of data. Please advise
Multiple suppliers send files at various times of the day to defined locations. The formats of the files sent by the suppliers are all different. These have to be processed and ingested ultimately into a set of tables, as and whe they arrive at the specified location.
The easy way is of course to design a job for each file and a job sequence with a waitfor activity for the arrival of the file. but this increases the development time, makes the job repetitive and creates maintenance hassles. So we are trying to design the jobs as follows:
1. Create a sequence for taking all the files from all suppliers and creating an uniform staging file, which is basically the universe of all the fields sent by the suppliers
2. Take the fields from the staging file and map it into the db tables as required.
We hit some problems in the scheduling. Basically the job we design either waits for arrival of all the files from the suppliers before it can start, or if we make the sequence start based on arrival of any of the files, then while the sequence is executing due to arrival of one file, subsequent files do not get picked up.
I am aware of scheduling DS jobs using unix scripts and all that, but don't know how. I actually want to schedule the script from the windows scheduler, so that the script can execute the generic job (with a different alias, depending on which input file it has been scheduled to processing, passing a list of arguments to be used by the job as parameter). I am looking for some advice/help on how to implement this. If anyone has sample scripts, c++ programs etc which can be used to schedule, or can direct me to the same, I will be most obliged.
Then there is the second problem of course! If there are multiple instances of the job that creates the staging file, does anything get overwritten? I mean, basically all the instances are trying to write into the same file concurrently so there are bound to be currency issues. Further the staging file has to be accessed to upload the tables, again as and when the staging file is not empty. So essentially it is a problem of how to schedule the same job(s) repeatedly, but to process different sets of data. Please advise