Performance Tuning needed in my Job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kashif007
Premium Member
Premium Member
Posts: 216
Joined: Wed Jun 07, 2006 5:48 pm
Location: teaneck

Performance Tuning needed in my Job

Post by kashif007 »

Hi Guys

I did a parallel job which reads data from two tables residing in an Oracle 8 platform. I used the Dynamic RDBMS stage to read the two tables. Table 1 has around 2.5 Million records (15 columns) and table 2 has around 5.3 million records (10 columns). I use a lookup stage(Auto partition) to accomplish the lookup logic. I was not sure if a join would help improve some performance. After lookup I have a transformer stage to do some business transformation, this transformer runs sequentially in a parallel job since I have logic to generate sequential numbers like (1,2,3,4.....) for one of the column. and finally I write the data into a flat file. The job runs around 18 minutes to process 5.1 millions records from the two sources to one file. :roll:

Is there anyway I can tweak my job and reduce the runtime to less than 18 mins. I am hoping I find a way to bring it down to 10 mins at the most. Please advice. :?

Thanks
Regards
Kashif Khan
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Replace the lookup with the join, make sure you sort and hash partition on the joining keys.
Lose the sequential mode in the transformer that generates the running numbers and use vmcburney's post in FAQ to generate running numbers in parallel.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
samsuf2002
Premium Member
Premium Member
Posts: 397
Joined: Wed Apr 12, 2006 2:28 pm
Location: Tennesse

Post by samsuf2002 »

Couple of options --

1.You can try joining the tables in the same query using better indexing (test on toad first if you are using it).

2. Try using join stage if your reference data is more than source.

3. Not sure if you can use surrogate key stage to generate your sequence number.

4. Use hash partition if you want to use Lookup stage.

Did you actually check what is taking more time in your job ?
hi sam here
kashif007
Premium Member
Premium Member
Posts: 216
Joined: Wed Jun 07, 2006 5:48 pm
Location: teaneck

Post by kashif007 »

Ok I have used the Join stage (hash partitioned sorted on key fields) and made the transformer to parallel default per vmcburney's logic of generating surrogate key. The run time drops down to 15' from 18'30". The data speed reduces when moving from Join stage to the transformer and then to the file. In the transformer I have null handling logic, couple of concatenations and couple of timestamp conversions, total columns processed in the Transformer are around 70. Will this be the best time I can get for my job.
Regards
Kashif Khan
Post Reply