Performance issue

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vcannadevula
Charter Member
Charter Member
Posts: 143
Joined: Thu Nov 04, 2004 6:53 am

Performance issue

Post by vcannadevula »

Hi,
I have a job that uses 7 stages.4 DB2 UDB stages, 1 for source , 2 for reference links,1 hash file and 1 for target database.Input data is comming from 4 different sources into transformer and output of trasformer is give to an aggregator and from aggregator to target data base.I am grouping by two columns in aggregatorand calculating sum of another 2 columns.I ran this job for previous month and it took 1hr.I ran the same job for current month and it took 10hrs??????????

The number of rows in the source increased by 170,000 from previous month to this month and I see in performance statistics as 112 rows/sec for all the links.

935004 rows read from stream Link
935004 rows read from Reference Link1
935004 rows read from Reference Link2
744255 rows read from Reference Link3(hash File)
744255 rows written to Aggregator
730890 rows written to Target
I am already using user defined SQL to filter rows in source itself to improve performance.Can Anybody see any reason why this job is taking so long?
Plz respond to me if this make sense to anybody!
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You are having scaling issues because your job is not designed to be scalable. It is designed to be easy to see everything happening. I bet next month your job never finishes in time.

A scalable job design would utilize bulk loading, it would no use direct lookups against the database during transformation, and it would not directly tie a source system select into a target system insert else update.

You're using DataStage Server release 5.x. This means that your job is a single-threaded process, so you're just using one cpu. Even with the aggregator, data will stall while it attempts to group. If your data is not sorted, then it has to receive all data and then spool the results. You're also not taking advantage of hash file technology to cache lookups, as well as the previously mentioned bulk loading.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
vcannadevula
Charter Member
Charter Member
Posts: 143
Joined: Thu Nov 04, 2004 6:53 am

Post by vcannadevula »

Hi
Thank You for your quick response.I will consider your recommendations.
What i don't understand is when i Look at performance statistics , it shows same number of rows Per second for both previous month and this month.But when i see the ratio of increase in number of rows to number of hours it took to complete the job, thats unbelievable.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

It's not unbelievable that a 20% row count increase has a 5X impact on the total runtime. The number of round trip lookups against the database is increasing in a non-linear fashion, as you are spending a lot more time asking the database a lot of times to go find data. This increase your presence in the database, making you more susceptible to more delay as your queries are queued. In addition, you're encountering larger rollback segments, as well as source query snapshots. The longer you take to load, the longer you hold segments open. You're in a vicious cycle that is not helping you scale.

The only way to scale is to build a design that allows you to seemless utilitize more cpus. The easiest method is thru job instantiation. By partitioning the source data (separating into logical groupings), you can run multiple copies of your job each handling its own portion of the data. Each job copy (instance) is identical, it's just working on its own set of the source data. When they are done, you just collate or concatenate the results and then load. As your volume grows, you can simply increase the numbe of instances. By breaking down your job design into separate jobs and utilizing hash lookups, you'll be able to use this technique and take advantage of cached lookups.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply