Number of transformer stages affects performance ?

kaps · Post by **kaps** » Thu Jul 24, 2014 9:11 am

First, I read somewhere that if we you use too many transformer stages in Parallel jobs it affects the performance because it needs to get compiled in c++ but later remember reading that it does not affect performance in newer versions. Can someone tell me if that's true ? Why was it not a problem now ? Basically, we can use filter stage or transform stage to filter rows but does the tranform stage affects the performance in this case ?

Thanks

PaulVL · Post by **PaulVL** » Thu Jul 24, 2014 9:20 am

Well, I would think that what you are doing within the transformers is really going to drive the answer to your question.

More stuff in a job will always lend itself to affecting performance vs less stuff in a job.

chulett · Post by **chulett** » Thu Jul 24, 2014 9:33 am

However, as a general statement while transformers were a 'performance issue' in the beginning, that hasn't been true for quite some time. So simply using transformers in a job should no longer be a concern but best to use them when actually needed - if the work can be done by something native to the framework like a Filter or Modify stage, use those instead.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 24, 2014 11:20 pm

Two issues are being confused here. One is the performance of ONE Transformer stage, which was a problem in earlier versions but isn't any more. The other is the performance of TOO MANY stages (of whatever type). Every stage (ignoring operator combination) will generate an additional process; too many processes will overload your server. How many is too many depends on what they are doing; you can monitor their resource consumption with various tools, including Monitor view in Director, the Performance Analyzer in Designer, and/or the DataStage Operations Console.

chulett · Post by **chulett** » Fri Jul 25, 2014 7:33 am

Not confused... which is why I specifically noted I was making a general statement.

ray.wurlod · Post by **ray.wurlod** » Fri Jul 25, 2014 4:29 pm

Conflated, then.

chulett · Post by **chulett** » Fri Jul 25, 2014 8:54 pm

That's it, conflated. Definitely... conflated. Yah.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Sat Jul 26, 2014 8:57 am

Right now I'm seeing lots of poorly designed jobs with multiple connected transformers:

transformer -> transformer -> transformer

Usually this is because the developer had a poor understanding of what could be done in a single transformer. Much of this is related to them not understanding the execution order that occurs in a transformer (which got a bit more complex with transformer looping).

So far I haven't encountered a single case that couldn't be consolidated into one transformer. Not only does it simplify the job, it also improves performance. As Ray said, most of the performance improvement comes from the fact that you are reducing resource usage (processes, memory) and eliminating several in-memory transfers of data.

Worst one I ever encountered was a job with a mess of about eight transformer stages connected with copy stages and funnels. The whole job was so bad I just trashed it and re-wrote it. Only required one transformer in the end and it ran orders of magnitude faster.

Note: conflation++

kaps · Post by **kaps** » Sat Jul 26, 2014 10:26 pm

Thanks for all valuable replies.

Ray, Can you tell how the problem with one transformer in a job is resolved in newer versions ?
Also, in my original question I stated about filter stage and transformer stage comparision. If I use transformer stage to just filter records, Is it going to affect the performance as it's not native to DataStage ?

eostic · Post by **eostic** » Sun Jul 27, 2014 7:39 am

Waaaay back (like in 6.x), Transformer Stages weren't as efficient in generating C++ code directly, or at all [memory is fading over time ; ) ]. That was a big part of it. ...at that time, using a Modify was probably the best way to go. That is OLD history. Great examples above about why you don't want a "ton" of transformers, but the pure idea of "using a Transformer" being a problem is long gone. The overhead is mostly just lots of stages, etc. as noted in the excellent points already made.

Ernie

ray.wurlod · Post by **ray.wurlod** » Sun Jul 27, 2014 2:46 pm

kaps wrote: If I use transformer stage to just filter records, Is it going to affect the performance as it's not native to DataStage ?

On the other hand, the transform operator is a directly compiled component, whereas the filter operator is more like interpreted (not strictly correct, as it uses a pre-built object for its actual work).

May I suggest that you build two jobs to compare the performance, and make sure that you use a statistically significant volume of data?

qt_ky · Post by **qt_ky** » Fri Aug 01, 2014 11:06 pm

Not to confound the matter, but the Transformer stage has been native to DataStage since day 1; just not native to the Orchestrate operators. Confuted?

ray.wurlod · Post by **ray.wurlod** » Sun Aug 03, 2014 5:45 pm

Confuted indeed.

Parallel jobs use the transform operator - the parallel Transformer stage is merely a convenient (= GUI) way of setting it up.

kaps · Post by **kaps** » Tue Aug 05, 2014 4:12 pm

Ray

I have tested the performance as you suggested and did not find much difference between the two stages. Upto 10 million records with 4 small columns the time taken is same between them and when I made it 100 million records job with trasformer stage actually finished 2 sces earlier than the job with Filter stage.

Basically job design is Row Generator to Transformer(or)Filter to sequential file.

So, can we conclude that use of filter stage insted of transforer stage does not improve the performance or the use of transformer stage does not inversly affect the performance.

Thanks

priyadarshikunal · Post by **priyadarshikunal** » Wed Aug 06, 2014 8:30 am

Read the red book, its already mentioned that transformer use is suggested instead of filter and switch if it can be used.

Ray and others already mentioned that they are not much of a overhead now a days. Discussion went from there to using to many stages in general and then to History of DataStage.

So I do not understand your point your are trying to make here. Am I missing anything?