Hi Performance job design

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Hi Performance job design

Post by dsuser_cai »

Hi

I have a requirement, I need to create jobs to populate table (insert). here is the table detail.

The source table has nearly 50 columns. and the target table is something like this, col1_pk, col2, col3_fk, col4_fk.... col59. I have indicated the primary key and foreign key with PK and FK. The target has only one primary key and two foreign keys. So I designed a parallel job that extracts data from the source(ODBC Stage) then use a look up stage to look up against the foreign key tables(target side) and then used a merge stage to merge them and then again use a look up to make a target look up to check if the rows exist or not. then generate keys and load the target (insert only). But my manager said this is a poor design and i need to remove or use very less look up stages (ultimately reduce the number of stages), and then design the job. Also the source table dosent have a creationdate or updatedate to extract the delta load, so in my design i used a full extract (assuming the source database are updated). I have one more thought, to remove the look up stage i can write a sql query that gets the foreign keys. the data is going to be huge (in millions). can anybody suggest me a design method or some points where i can make some improvements. Any help would be appreciated.
I used auto partition everywhere.

[/code]
Thanks
Karthick
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

FYI - Feel free to use returns and whitespace to make your explanation and question easier to read.

For your design, you have a number of questions that you indirectly asked. I am going to try to sketch out the job the way I understand your explanation. Then I'll put down some things to think about.

Unfortunately, ETL is not simply a recipe. There are too many variables and conflicting design parameters. For the best performance, you will need to test different adjustments to find what offers the best balance of performance measurements within your environment.

ODBC -> Copy => Multiple Lookups (using ODBC reference link) => Merge -> Lookup (ODBC reference link) -> Keygen -> Ins ODBC

1. What do you do with Update records? If you drop the update records, move the lookup that checks for the existence of a record to the beginning. Best practice is to retrieve as few records as you can and to do as little processing as possible before discarding a record. In other words - as few records as possible as early as possible.

2. On all the lookups, you have to choose the strategy you will use. Your options include Merge, Join or Lookup stages. You also have to decide between using Database (Sparse and Normal), Lookup File Set, Sequential File, Data Set as your reference/update link(s).

The choices on lookups are significant, and have been discussed many times here. Do a search on the forums and talk with anybody local that has some DataStage experience for pointers. If you have a specific question about strategy and cannot find it in the forums, then post a follow-up, specific question. The answer will probably be "It Depends", but a specific question with more details does have a better chance of being able to be given a usable answer.

3. As I understand your explanation, you can do the same job without the Merge stage. Move the lookups in-line.

4. If you have not done it already, you can also combine the lookups into a single lookup with multiple reference links. Doing this has a potential downside (can be worked around) of not being able to tell which lookup failed if you use a reject link.
dsuser_cai
Premium Member
Premium Member
Posts: 151
Joined: Fri Feb 13, 2009 4:19 pm

Post by dsuser_cai »

Hi jcthornton

Thank you so much. Your suggestions were very useful.

1)What do you do with Update records? This really made me think. I will move the targte look up to the begining, so i will be pulling and processing only few data. Also we are going to add creationdate and updatedate to the source, so with this i can only pull differencial data.
Thanks
Karthick
mansoor_nb
Participant
Posts: 48
Joined: Wed Jun 01, 2005 7:10 am

Post by mansoor_nb »

Instead of lookup, one can use the change capture stage to capture the insert and the update records. Also the change capture stage is performance efficient.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Does "Hi" performance mean your job generates a "Hello" message?

Please strive for a professional standard of written English on DSXchange. It helps those whose first language is not English.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I guess so, because if you speed the job up you can say "Hello performance!". :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply