Partitioning in SMP

lstsaur · Post by **lstsaur** » Fri Sep 04, 2009 1:28 pm

Please read the chapter Configuration File for An SMP of Parallel Job Developer's Guid that will answere your doubt.

ray.wurlod · Post by **ray.wurlod** » Fri Sep 04, 2009 3:15 pm

1) Not necessary, but wise. You get the same benefits as on MPP.

2) Your result is an artifact of using a small data volume. For large data volumes, you will get a quicker completion time using two nodes versus using one.

3) Partitioning works in exactly the same way on an SMP system with just two exceptions: Entire partition is managed by creating one Data Set in shared memory, and all Section Leader processes can be started with fork() so there is no requirement for configuring rsh

mavrick21 · Post by **mavrick21** » Fri Sep 04, 2009 4:53 pm

Ray,

Using a 2 node config file would result in more processes spawned but how does that guarantee faster execution since my job, when it runs, consumes 98 to 100% of both the CPUs. Running on a 2 node config would result in slower execution due to context switching time among processes and maybe that's the reason my job(using 2-node) took longer time to complete than 1-node

A brief layout of my job on which I experimented with 1-node and 2-node config files(with appropriate partitioning):

Code: Select all

                 DataSet
                 [ 5.5 mil records,
                 250 fields]
                       |
                       |
Sequential file  --> Lookup --> Transfomer --> Filter ----> Funnel --> Dataset 
[1 record,                                    [9 links out 
2 fields]                                       from filter]

Since I didn't find any improvement in execution time with 2-node I reverted back to 1-node config file.

Thanks for your time.

chulett · Post by **chulett** » Fri Sep 04, 2009 7:43 pm

ray.wurlod wrote:2) Your result is an artifact of using a small data volume. For large data volumes, you will get a quicker completion time using two nodes versus using one.

I would say that one input record counts as a 'small data volume'. And what kind of parallel processing do you think would be going on in a job that processes a single record? How many rows come from the lookup to the target? I'm wondering if the answer is 1 or 5.5 million.

Out of curiousity, is that just a testing volume and it will be a great deal larger in reality or is that all it will ever do?

mavrick21 · Post by **mavrick21** » Thu Sep 10, 2009 1:29 pm

How many rows come from the lookup to the target? I'm wondering if the answer is 1 or 5.5 million.

5.5 million records get populated in target.

Out of curiousity, is that just a testing volume and it will be a great deal larger in reality or is that all it will ever do?

It's running in production. So the number of records may be between 5 and 6 million.

priyadarshikunal · Post by **priyadarshikunal** » Fri Sep 11, 2009 3:07 am

when it runs, consumes 98 to 100% of both the CPUs.

Since you have 2 CPUs and resource utilization is high, increasing the number of nodes will not give you better results.

If you are increasing the number of nodes you need to make sure that there are enough resources available for consumption by extra processes created.

You get the same benefits as on MPP.

Ray is correct on this point. But it should be extended to include resource availability.

stefanfrost1 · Post by **stefanfrost1** » Fri Sep 11, 2009 5:36 am

Code: Select all

                 DataSet 
                 [ 5.5 mil records, 
                 250 fields] 
                       | 
                       | 
Sequential file  --> Lookup --> Transfomer --> Filter ----> Funnel --> Dataset 
[1 record,                                    [9 links out 
2 fields]                                       from filter]

I would suggest flipping this job so you can enable streaming of data. This approach needs to buffer all 5.5 million lookup records before processing further..... Streaming data is very important.....

goutam · Post by **goutam** » Fri Sep 11, 2009 6:00 am

I guess, one can see the real performance diff betwen 1 node and 2 node if lookup stage will be replaced by join stage.

In current situation , most of the execution time is spent on lookup stage , no matter which config file is used...

mavrick21 · Post by **mavrick21** » Mon Sep 14, 2009 1:51 pm

@ Everyone,

Thanks for all your inputs.