DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
Author Message
vmcburney

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 23 Jan 2003
Posts: 3076
Location: Australia, Melbourne
Points: 23292

Post Posted: Sat Mar 07, 2009 7:02 pm Reply with quote    Back to top    

DataStage® Release: 7x
Job Type: Parallel
OS: Unix
A French firm called ManApps is running a public benchmark comparing DataStage to Talend, Pentaho and Informatica. They don't know much about DataStage and have some design flaws. Have a look at this job and comment on how you would improve it and I will email feedback back to ManApps.

(For more background you can read my December blog post ETL Benchmark Favours DataStage and Talend and my March update ManApps ETL Benchmark Redux: Informatica Surges to Lead .)

Scenario:
Reading X lines from a file input delimited, looking up to another file input delimited, for 4 fields using id_client column. Writing the jointure result into a file output delimited.

Parallel Job Design:
Image

File_input_delimited has three fields, file_lookup_delimited has nine fields. There are four test cases based on increasing the row volume, in all four the lookup has 100K rows and the input has 100K, 1M, 5M and 20M rows.

How can this job be improved?

_________________
Focus Strategies IM Practice Lead
We are hiring - come work for the Australian IBM Information Management Partner of the Year 2010
sima79



Group memberships:
Premium Members

Joined: 16 Jul 2007
Posts: 38
Location: Melbourne, Australia
Points: 230

Post Posted: Sat Mar 07, 2009 8:41 pm Reply with quote    Back to top    

Definately as you have posted in your blog, change the join to a lookup stage. I would also go with a 1 node configuration as well. I have had a look at the other parallel job designs and have no idea why they have chosen to use a join stage which requires a sorted input, given the number of rows in the reference file.

Another possible option is to separate the file I/O from the parsing of the columns using a sequential file and a column import stage.
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 40572
Location: Melbourne, Australia
Points: 226407

Post Posted: Sun Mar 08, 2009 6:31 am Reply with quote    Back to top    

Lose the repartitioning by moving the partitioning to the inputs of the Modify stage especially in a multiple-machine environment. For larger volumes, reading the second file into a Lookup File S ...

_________________
Zenith Solutions (Australia) Pty Ltd
-- from obscurity to clarity
Rate this response:  
Not yet rated
stewarthanna
Participant



Joined: 09 Mar 2009
Posts: 1

Points: 6

Post Posted: Mon Mar 09, 2009 8:21 am Reply with quote    Back to top    

They should change to lookup.

Use Configuration files that match the volume, scaling from 1node to 2, 3, 4....using a larger config file will only incur startup time when processing small volumes

Remove unnecessary partitioning and sorting, lookup does not require it

Add -readers to Sequential File Stage to improve performance reading data, but I doubt this will be a bottleneck

_________________
Stewart
Rate this response:  
Not yet rated
rwierdsm


since March 2009

Group memberships:
Premium Members

Joined: 09 Jan 2004
Posts: 192
Location: Toronto, Canada
Points: 1654

Post Posted: Mon Mar 09, 2009 11:08 am Reply with quote    Back to top    

Thought I read somewhere that reading a sequential file as a single wide column and then doing a column split as the second stage makes the reads faster. Allows the column determination to happen in parallel instead of sequetial.

Rob

_________________
Toronto, Canada
bartonbishop.com
Rate this response:  
Not yet rated
priyadarshikunal
Participant



Joined: 01 Mar 2007
Posts: 784
Location: Dublin
Points: 4368

Post Posted: Mon Mar 09, 2009 12:24 pm Reply with quote    Back to top    

While using joiner also (Not a good option), modifying it a bit outperform informatica as there is marginal differnece. As Ray mentioned moving the partioning to the first link will help because its first partitioning and then repartioning.

Wonder if they are using stable sort or performing any thing inside modify stage.

Get rid of extra logging information and switch off job monitor.


and the best option would be to use lookup because it will love to handle 7 MB or 34MB or 68 MB (Test 12) of lookup file in whatever partition and with whatever number of nodes.

As they have mentioned that they changed job for informatica to use 1 partition for less number of records. why not used single node for less number of records. I think the startup time consists of around half of the time taken to process 100,000 records.

I would love to see the timings of all tools when running on data 10 times more than they are using now. but i can't see their machine much capable of handling that.

_________________
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. Wink
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 40572
Location: Melbourne, Australia
Points: 226407

Post Posted: Mon Mar 09, 2009 3:08 pm Reply with quote    Back to top    

Are you permitted to use a server job for small volumes?

_________________
Zenith Solutions (Australia) Pty Ltd
-- from obscurity to clarity
Rate this response:  
Not yet rated
vmcburney

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup

Joined: 23 Jan 2003
Posts: 3076
Location: Australia, Melbourne
Points: 23292

Post Posted: Mon Mar 09, 2009 4:31 pm Reply with quote    Back to top    

Good question - the benchmark has a set of results for server edition and parallel edition. The server job version has a hash file lookup. I will post a couple more job designs later including a ser ...

_________________
Focus Strategies IM Practice Lead
We are hiring - come work for the Australian IBM Information Management Partner of the Year 2010
Rate this response:  
Not yet rated
balajisr


since October 2006

Group memberships:
Premium Members

Joined: 28 Jul 2005
Posts: 785

Points: 3770

Post Posted: Tue Mar 10, 2009 1:16 am Reply with quote    Back to top    

Interesting Post.

What does modify_1 do? Does it drops certain columns or renames column name? If it only drop columns we can use sequential file property "Drop on Input" instead of modify stage modify_1.

Setting Env variable APT_TSORT_STRESS_BLOCKSIZE variable to an appropriate value will speed up join for huge volumes.

Another option is to use lookup stage which is already mentioned.
Rate this response:  
Not yet rated
bobyon



Group memberships:
Premium Members

Joined: 02 Mar 2004
Posts: 127
Location: Mooresville, NC
Points: 1222

Post Posted: Wed Jun 24, 2009 3:55 pm Reply with quote    Back to top    

I see this post is a little old and maybe I'm to late but how about putting in explicit sort stages so you can specify the amount of memory to use.

_________________
Bob
Senior Systems Analyst
Lowe's Home Improvement
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours