which one is best Unix Join or Datastage Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

which one is best Unix Join or Datastage Join

Post by sachin1 »

Hi Team,

I have two files as input on Unix file system which has to be joined, should i go for Unix join operator or should i use Data stage join which will have these file as input.

Please assist.

Thanks,
Sachin.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Seems to me, the typical answer in cases like this is "depends". In your shoes, if I really wanted to answer that question, I would try both. Compare and contrast with your data on your systems, then decide which one to stick with.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The join operation in both UNIX and in DataStage is a very simple one which takes two inputs sorted on the join key and does a Group-Change comparison on them.

The UNIX join requires sorted data. If your data is not sorted the you would Need to do that.

If you were to read those files into DataStage you could sort there, which may make a difference on big files when using a parallel configuration with several nodes.

If the files are already sorted, then I'd use an external source stage which calls the UNIX join and outputs straight to DataStage; that way you wouldn't Need to write the join result to disk and then read it in DataStage.
Post Reply