General design principle for DataStage

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
wfkurtz1
Premium Member
Premium Member
Posts: 17
Joined: Sat Oct 30, 2010 6:59 am
Location: Philadelphia
Contact:

General design principle for DataStage

Post by wfkurtz1 »

In general (don't you hate it when someone begins like that?) ... is it better to include as much logic as possible into one big job ... or ... to break up the logic into as many simple jobs as possible?

We have an ETL application with over 20 simple jobs, each not much more than one extract stage, one or two data manipulation stages and one load stage each. These are sequenced together and the sequence is run by dsjob which is run by AutoSys. a very modular design borrowed from other SW development paradigms.

The reason I ask is that a veteran DSer took a look at this and asked "why so many jobs?" He said "make as few connections to databases as possible and do as much as you can on the DS server."

OK ... what do you think? And "It depends" is not the right answer ... just joking :) :)

Everybody says IT people give that answer all the time!
"The price of freedom is eternal vigilance."
-- Thomas Jefferson
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's always a compromise. The advise to minimize connections to databases is sound, but some of the work may involve no database connections and so could be a candidate for a modular design. Environmental factors, such as limited time windows for accessing source and target systems, may also contribute to the design decisions. You can still do the "T" part of the ETL even though you have to do the "E" and "L" parts at different times - here again a modular design is indicated. As noted, it depends.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
greggknight
Premium Member
Premium Member
Posts: 120
Joined: Thu Oct 28, 2004 4:24 pm

Post by greggknight »

Well my 2 cents would go like this.
I have been building datastage jobs since version 3 I am currently using 8.5 32bit and 8.5 64 bit.
That said Simpler is better, why when it comes time to determine why a job takes so long and you have 50 stages in the job you will understand why.
second
not only are jobs parrellel but pipe line is also used. This means that when a job starts all stages start so that a continuous flow is available. no waiting on stages.
depending on the configuration your using , 1 node or n nodes osh processes are established ( alot of them) with a job which contains a lot of stages. Memory and cpu come into play.
I guess what I am saying is you need to look at the big picture. Today 8 jobs and a sequence tommorrow 8000 jobs.

Its too late then.
I just started migrating a data warehouse which was coded on an as400 to datastage. I have 50 dim and 6 facts so far . I have 2000 jobs .60 of those jobs run 17 instances a piece at the same time. And we have just started. we will have a lot more facts before its over. Thats not including all the other jobs and projects that I will be migrating to 8.5 Like I said you need to look at the big picture. And decide from there.
"Don't let the bull between you and the fence"

Thanks
Gregg J Knight

"Never Never Never Quit"
Winston Churchill
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance. (This was the customer's requirement, as well as an interesting technical challenge.) For example SQL statements are generated "on the fly" from information in the system tables. Many things are parameterised.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I think there is a trade off between performance and making it simple. The fewer stages in a job then easier it is to follow and therefore modify. The more times you land the data the longer it takes to run from start to finish. There are always exceptions to the rule and busniess requirements that force you to do things you might not prefer doing. A lot of source systems are stretched to the limit. So you have to extract the data as quickly as possible and land it. Then off load as much processing as possible. This changes your design.

I have seen having a lot of jobs is just as hard to follow as one big job. So 8,000 jobs with 3 stages maybe worse than 800 jobs with 30 stages average. I find that consistency is more important than either of these. I have seen 8,000 jobs work just as smooth as 800 jobs especially if the jobs are very similar and the naming conventions are good.

Quality comes in many shapes and sizes. Try to be open to new ideas. Sometimes you might get surprised.
Mamu Kim
swapnilverma
Participant
Posts: 135
Joined: Tue Aug 14, 2007 4:27 am
Location: Mumbai

Post by swapnilverma »

You can break your job on below factors :-

Complexity - At present you have simpler jobs which are good.

Too simple ???? not a good choice

Restart ability- Incase of any issue / error from which point you can restore processing ? ( can you restart the job with out manual changes ??? )

Processing time -- If after combining few jobs you are not getting enough performance benefit Is there a point to combine them ??

If you have few complex job this will be tricky thing to achieve .

so decide based on your jobs nature ...

hope it helps
Thanks
Swapnil

"Whenever you find whole world against you just turn around and Lead the world"
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

The one concept not mentioned yet is reusability. I have a small app (that will quadruple in size by the time we're done) that has the identical initial input stage for every job. It's paramaterized according to the source data configuration (z/OS mainframe datasets generated by Cobol modules). After working out some design issues with the first few jobs, I've been reusing that basic stage and will continue to do so for the rest of the project.

That's a small-scale example. I would assume -- with some confidence -- that a large application expected to have hundreds of jobs will have several opportunites along that line.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Excellent point.
Mamu Kim
evee1
Premium Member
Premium Member
Posts: 96
Joined: Tue Oct 06, 2009 4:17 pm
Location: Melbourne, AU

Post by evee1 »

ray.wurlod wrote:In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance.
I have just started a project (early design yet) that has the same goal in mind. It is a re-write of the existing DWH using Datastage instead of a bunch of varying technolgies used currently for ETL-ing.

Ray,
I was wondering whether you would be willing to share more thoughts on the approach you have implemented. I'm particulary interested in the issues that have posed the major challenges for you (if there were any :wink:).

I was also wondering what is the nature and size of this project, although I understand if this is too sensitive information to share.
In addition, was there any estimation done on how much development effort is saved using such an approach oppose to the conventional (thousands jobs) one?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Here is Ray's post on his technical challenge:

viewtopic.php?t=138403
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply