Running Multiple Instances-Partition By Date

pradkumar · Post by **pradkumar** » Wed Jan 03, 2007 1:11 am

Hi

I am trying to run my Job as Multiple instances.. I developed a job sequence with six instances of the job.

I am trying to partition the data on time stamp. I am having a create_date in my source data.
So I am trying to use a job parameter to which I can assign a year and in the main job's WHERE cluase I am using htis year as " Where creation_date = #YEAR#.

This looks promising for FULL LOAD

But to do a incremental load, I think this is not going to work..Plz let me know any ideas or comments on this.

wnogalski · Post by **wnogalski** » Wed Jan 03, 2007 2:11 am

Do You want to have the same dates (date ranges) in each file? F.e. 20070101-20070105 into file1, 20070106-20070110 to file2 etc. ?
If not then simply use Link Partitioner stage and put the data into sequential files and then use them in Your multiple instance job.
If yes then this would require some kind of calculation to get proper splitting parameter(s) (according to your needs) and then using them in a Transformer stage to put rows into proper files.

ray.wurlod · Post by **ray.wurlod** » Wed Jan 03, 2007 2:58 am

If your incremental load contains creation dates from all years, then your approach (assuming it works for full load) should work OK for incremental load.

I'm dubious about comparing something called "...date" against a job parameter called "#YEAR#" - maybe I'm more finicky about meaningful names than you are.

If your incremental loads contain only dates from the current year, then five of your six job invocations will process zero rows. While not wrong, it seems inefficient to start them at all.

You may want to clone the job for incremental loads, and make the division of work a finer level of granularity - perhaps months, perhaps date ranges. Whatever you choose you would implement through job parameters just as you have with the #YEAR# parameter for the full load.

loveojha2 · Post by **loveojha2** » Wed Jan 03, 2007 3:02 am

What about
creation_date = #YEAR# and creation_date > #Last Run Date#

#Last Run Date# can be another parameter passed from the Sequence, and you need to maintain it for all the partitions. You need to write routine(s) to write/fetch the last run date for each instance/partition.

pradkumar · Post by **pradkumar** » Wed Jan 03, 2007 12:36 pm

Thanks for all your ideas.
I am actually using two date job parameters as LoadDate and LastLoadDate.
My where clause is like

Code: Select all

Creation_Date Between LastLoadDate and LoadDate

So now using Sequence and 6 job activities my where clause in the main job will be something like

Code: Select all

Extract year from Creation_Dtae and compare it to the Year from Job activity.

Code: Select all

JobActivity1:  LoadDate = 2006-12-31; LastlOadDate = 2006-01-01; Year ='2006'
JobActivity2:  LoadDate = 2005-12-31; LastlOadDate = 2005-01-01; Year ='2005'
JobActivity3:  LoadDate = 2004-12-31; LastlOadDate = 2004-01-01; Year ='2004'
 and so on until 2000

The above one will be for full load. By the way i am using the LoadDate column in the final table to record the date it is loaded. So the above one is going to create a mess with a loaddate. So I decided to use another column known as RunDate and update it to the date when it is loaded in data warehouse.

Any comments on my idea plz?

wnogalski · Post by **wnogalski** » Wed Jan 03, 2007 1:09 pm

Do You want to extract the data in parallel jobs ? IMHO it will be better to extract the data in one job, split it using Transformer and Constraints (f.e. link1 loadyear=2006, link2 loadyear=2005 etc.) and land the data into sequential files which will be transformed in parallel.
When all needed transformations, data cleansing activities etc. are done use Link Collector to put the data together into one sequential file and then insert it (or use bulk loader) into destination table in Your DW.

pradkumar · Post by **pradkumar** » Wed Jan 03, 2007 6:15 pm

I think running multiple instances will be better in terms of performance. So I am opting for it

wnogalski · Post by **wnogalski** » Thu Jan 04, 2007 12:49 am

It depends whats your source.