Partitioning in SMP

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Please read the chapter Configuration File for An SMP of Parallel Job Developer's Guid that will answere your doubt.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1) Not necessary, but wise. You get the same benefits as on MPP.

2) Your result is an artifact of using a small data volume. For large data volumes, you will get a quicker completion time using two nodes versus using one.

3) Partitioning works in exactly the same way on an SMP system with just two exceptions: Entire partition is managed by creating one Data Set in shared memory, and all Section Leader processes can be started with fork() so there is no requirement for configuring rsh
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Ray,

Using a 2 node config file would result in more processes spawned but how does that guarantee faster execution since my job, when it runs, consumes 98 to 100% of both the CPUs. Running on a 2 node config would result in slower execution due to context switching time among processes and maybe that's the reason my job(using 2-node) took longer time to complete than 1-node

A brief layout of my job on which I experimented with 1-node and 2-node config files(with appropriate partitioning):

Code: Select all

                 DataSet
                 [ 5.5 mil records,
                 250 fields]
                       |
                       |
Sequential file  --> Lookup --> Transfomer --> Filter ----> Funnel --> Dataset 
[1 record,                                    [9 links out 
2 fields]                                       from filter]

Since I didn't find any improvement in execution time with 2-node I reverted back to 1-node config file.

Thanks for your time.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ray.wurlod wrote:2) Your result is an artifact of using a small data volume. For large data volumes, you will get a quicker completion time using two nodes versus using one.
I would say that one input record counts as a 'small data volume'. And what kind of parallel processing do you think would be going on in a job that processes a single record? How many rows come from the lookup to the target? I'm wondering if the answer is 1 or 5.5 million.

Out of curiousity, is that just a testing volume and it will be a great deal larger in reality or is that all it will ever do? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

How many rows come from the lookup to the target? I'm wondering if the answer is 1 or 5.5 million.
5.5 million records get populated in target.
Out of curiousity, is that just a testing volume and it will be a great deal larger in reality or is that all it will ever do?
It's running in production. So the number of records may be between 5 and 6 million.
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

when it runs, consumes 98 to 100% of both the CPUs.
Since you have 2 CPUs and resource utilization is high, increasing the number of nodes will not give you better results.

If you are increasing the number of nodes you need to make sure that there are enough resources available for consumption by extra processes created.
You get the same benefits as on MPP.
Ray is correct on this point. But it should be extended to include resource availability.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
stefanfrost1
Premium Member
Premium Member
Posts: 99
Joined: Mon Sep 03, 2007 7:49 am
Location: Stockholm, Sweden

Post by stefanfrost1 »

Code: Select all

                 DataSet 
                 [ 5.5 mil records, 
                 250 fields] 
                       | 
                       | 
Sequential file  --> Lookup --> Transfomer --> Filter ----> Funnel --> Dataset 
[1 record,                                    [9 links out 
2 fields]                                       from filter] 
I would suggest flipping this job so you can enable streaming of data. This approach needs to buffer all 5.5 million lookup records before processing further..... Streaming data is very important.....
-------------------------------------
http://it.toolbox.com/blogs/bi-aj
my blog on delivering business intelligence using agile principles
goutam
Premium Member
Premium Member
Posts: 109
Joined: Thu Jul 26, 2007 6:53 am

Post by goutam »

I guess, one can see the real performance diff betwen 1 node and 2 node if lookup stage will be replaced by join stage.

In current situation , most of the execution time is spent on lookup stage , no matter which config file is used...
Goutam Sahoo
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

@ Everyone,

Thanks for all your inputs.
Post Reply