Job design for n-way outer join??

andrewn · Post by **andrewn** » Thu Sep 27, 2007 5:13 am

Hi there DSXCHANGE :D ,

Can anyone offer any advice on a job design to perform a n-way outer join?

I have a job to build a datamart FACT table using various other tables. It does an 8-way outer join like this:

Code: Select all

  
DB2 UDB -->
               -> JOIN  -->XFORM    --> 
DB2 UDB -->                            > JOIN   --> XFORM  -- >
                            DB2 UDB -->                       > JOIN -->
                                                     DB2UDB -->
                                                                                                 etc...

Each JOIN is an outer join and every Transformer is fixing the null keys from the join output: if isnotnull(leftkey) then leftkey else rightkey. There are three keys.

The net result of the job is about 3 million keys, but there are far fewer keys coming from the source tables on the input to some of the joins.

I'm wondering if there is any way round having all the transformers because they add to the CPU cost of running the job?

The job is being run on a single node configuration, DataStage is combining the joins and transformers into a single process, but using APT_DISABLE_COMBINATION I can see that the transformers take almost as much elapsed time and CPU as the joins.

Maveric · Post by **Maveric** » Thu Sep 27, 2007 5:22 am

"if isnotnull(leftkey) then leftkey else rightkey". how would this work after the join stage? Your output data will have only one key filed. So you dont get any leftkey and rightkey field values. Am i missing something here?

andrewn · Post by **andrewn** » Thu Sep 27, 2007 5:45 am

Sorry, I meant "leftkey" as the key from the left side of the join and "rightkey" as the key from the right side of the join. Left and right being the two inputs to the join.

The job is using common keys in all joins so I don't want any rows with null keys at the start of the next join.

Assuming there is just one key field I thought I had to output input0.keyA AND input1.keyA from the join and then because one could be null use a transformer to pick a none null one which is output to the keyA column? The transformer output has less columns than on input.

Maveric · Post by **Maveric** » Thu Sep 27, 2007 6:24 am

Still not sure about your leftkey, rightkey logic. Since it is a full outer join, If there is a match then the both the values will be same. If one is null then there wont be a corresponding record from the second stream. Say Leftkey is null, no corresponding value from right stream, so rightkey value will be null. At the end of the transformer the key will still be null. What i am trying to say is that i don't see the necessity for the logic. Hence all those transformers.

andrewn · Post by **andrewn** » Thu Sep 27, 2007 7:54 am

I don't think I've explained myself very well

.

...Since it is a full outer join, If there is a match then the both the values will be same. If one is null then there wont be a corresponding record from the second stream. Say Leftkey is null, no corresponding value from right stream, so rightkey value will be null...

I thought if you had two inputs, with fields KEY1, FACT1 on the first and KEY1, FACT2 on the second, then the output of the JOIN stage using KEY1 as the key and set to "outer join" is a schema like this:

leftRec_KEY1, FACT1, rightRec_KEY1, FACT2.

So I can have a row with null leftRec_KEY1 and null FACT1 but populated rightRec_KEY1 and FACT2 fields. Which would happen when there is a row on the second inupt with no match on the first input ( I wouldn't get this row when using a "left outer" join). Isn't that how it works?

So how do I join this with a third input with fields KEY1,FACT3 ?

Don't I need to get it back to to a schema KEY1, FACT1, FACT2 - so that I can again join using KEY1 with my third input.

The transformer needs to work out if it should use leftRec_KEY1 or rightRec_KEY1 when creating the KEY1,FACT1,FACT2 schema.

But I was wondering if there is an alternative somehow, mabe not.

Maveric · Post by **Maveric** » Thu Sep 27, 2007 8:11 am

Got it. Doubt there is any better way of doing it. Nothing that i can think of. You will have to wait for the gurus to answer.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 27, 2007 4:33 pm

The Join stage does support more than two inputs. However, internally it still produces pairwise joins (as would a database server) joining the next-rightmost input to the intermediate result of joining the previous pair.

I think your design makes it clearer what's going on, and will probably be easier on memory/scratch space simply by virtue of the fact that the pairwise joins are spread out in time.

andrewn · Post by **andrewn** » Fri Sep 28, 2007 2:45 am

OK thanks. I didn't think of using more than two inputs into a join, I might play with that and see how the job behaves. Good to know there isn't a magic bullet hidden away though!