join stage for huge data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

join stage for huge data

Post by wuruima »

In business requirement, we need to join 2 file by 2 keys: key_a and key_b,
each file contains more than 10,000,000 records.

is there any good idea to tune the performance ?
wuruimao
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

hash partition, sort and join. Do you have any question in particular regarding join? 10 million is not huge IMO.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The join stage is a small and efficient stage that is very, very fast. What can take time when processing large amounts of data is the sorting and partitioning that needs to take place in order for the join to do its job.

If possible, do your sorting where it is quickest - that can be in the initial data select or within DataStage.
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

Post by wuruima »

Yes, it takes long time to sort the input links. So is there a solution to tune the sorting
wuruimao
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If your database is on another server and has spare capacity, then sort on your SELECT; otherwise use the sort stage and look into the settings you can specify on that stage.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If sorting large files is something you need to do a lot and the DataStage sort doesn't seem fast enough for you, you could look into leveraging a 3rd party "high speed" sort package like SyncSort or CoSORT. From what I recall the latter has a DataStage module.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rkashyap
Premium Member
Premium Member
Posts: 532
Joined: Fri Dec 02, 2011 12:02 pm
Location: Richmond VA

Post by rkashyap »

If you decide to sort in the database, then to prevent DataStage from adding tsort operator ... either add sort stage with the option 'Don't sort if previously sorted' or add environment variable APT_SORT_INSERTION_CHECK_ONLY(and set it to True) in the job.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Adding the environment variable APT_SORT_INSERTION_CHECK_ONLY is only one of the options available, the preferred method is indeed to add what I call a "dummy" sort stage which adds the appropriate "don't sort" option. One still needs to make sure that the data is correctly partitioned when running in parallel.
Post Reply