Running Multiple Instances-Partition By Date

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Running Multiple Instances-Partition By Date

Post by pradkumar »

Hi

I am trying to run my Job as Multiple instances.. I developed a job sequence with six instances of the job.

I am trying to partition the data on time stamp. I am having a create_date in my source data.
So I am trying to use a job parameter to which I can assign a year and in the main job's WHERE cluase I am using htis year as " Where creation_date = #YEAR#.

This looks promising for FULL LOAD

But to do a incremental load, I think this is not going to work..Plz let me know any ideas or comments on this.
Pradeep Kumar
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

Do You want to have the same dates (date ranges) in each file? F.e. 20070101-20070105 into file1, 20070106-20070110 to file2 etc. ?
If not then simply use Link Partitioner stage and put the data into sequential files and then use them in Your multiple instance job.
If yes then this would require some kind of calculation to get proper splitting parameter(s) (according to your needs) and then using them in a Transformer stage to put rows into proper files.
Regards,
Wojciech Nogalski
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If your incremental load contains creation dates from all years, then your approach (assuming it works for full load) should work OK for incremental load.

I'm dubious about comparing something called "...date" against a job parameter called "#YEAR#" - maybe I'm more finicky about meaningful names than you are.

If your incremental loads contain only dates from the current year, then five of your six job invocations will process zero rows. While not wrong, it seems inefficient to start them at all.

You may want to clone the job for incremental loads, and make the division of work a finer level of granularity - perhaps months, perhaps date ranges. Whatever you choose you would implement through job parameters just as you have with the #YEAR# parameter for the full load.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

What about
creation_date = #YEAR# and creation_date > #Last Run Date#

#Last Run Date# can be another parameter passed from the Sequence, and you need to maintain it for all the partitions. You need to write routine(s) to write/fetch the last run date for each instance/partition.
Success consists of getting up just one more time than you fall.
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

Thanks for all your ideas.
I am actually using two date job parameters as LoadDate and LastLoadDate.
My where clause is like

Code: Select all

Creation_Date Between LastLoadDate and LoadDate
So now using Sequence and 6 job activities my where clause in the main job will be something like

Code: Select all

Extract year from Creation_Dtae and compare it to the Year from Job activity.

Code: Select all

JobActivity1:  LoadDate = 2006-12-31; LastlOadDate = 2006-01-01; Year ='2006'
JobActivity2:  LoadDate = 2005-12-31; LastlOadDate = 2005-01-01; Year ='2005'
JobActivity3:  LoadDate = 2004-12-31; LastlOadDate = 2004-01-01; Year ='2004'
 and so on until 2000
The above one will be for full load. By the way i am using the LoadDate column in the final table to record the date it is loaded. So the above one is going to create a mess with a loaddate. So I decided to use another column known as RunDate and update it to the date when it is loaded in data warehouse.

Any comments on my idea plz?
Pradeep Kumar
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

Do You want to extract the data in parallel jobs ? IMHO it will be better to extract the data in one job, split it using Transformer and Constraints (f.e. link1 loadyear=2006, link2 loadyear=2005 etc.) and land the data into sequential files which will be transformed in parallel.
When all needed transformations, data cleansing activities etc. are done use Link Collector to put the data together into one sequential file and then insert it (or use bulk loader) into destination table in Your DW.
Regards,
Wojciech Nogalski
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

I think running multiple instances will be better in terms of performance. So I am opting for it
Pradeep Kumar
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

It depends whats your source.
Regards,
Wojciech Nogalski
Post Reply