join stage for huge data

wuruima · Post by **wuruima** » Mon Jun 29, 2015 12:32 am

In business requirement, we need to join 2 file by 2 keys: key_a and key_b,
each file contains more than 10,000,000 records.

is there any good idea to tune the performance ?

priyadarshikunal · Post by **priyadarshikunal** » Mon Jun 29, 2015 1:31 am

hash partition, sort and join. Do you have any question in particular regarding join? 10 million is not huge IMO.

ArndW · Post by **ArndW** » Mon Jun 29, 2015 1:54 am

The join stage is a small and efficient stage that is very, very fast. What can take time when processing large amounts of data is the sorting and partitioning that needs to take place in order for the join to do its job.

If possible, do your sorting where it is quickest - that can be in the initial data select or within DataStage.

wuruima · Post by **wuruima** » Mon Jun 29, 2015 8:18 pm

Yes, it takes long time to sort the input links. So is there a solution to tune the sorting

ArndW · Post by **ArndW** » Tue Jun 30, 2015 1:40 am

If your database is on another server and has spare capacity, then sort on your SELECT; otherwise use the sort stage and look into the settings you can specify on that stage.

chulett · Post by **chulett** » Tue Jun 30, 2015 6:43 am

If sorting large files is something you need to do a lot and the DataStage sort doesn't seem fast enough for you, you could look into leveraging a 3rd party "high speed" sort package like SyncSort or CoSORT. From what I recall the latter has a DataStage module.

rkashyap · Post by **rkashyap** » Tue Jun 30, 2015 10:51 am

If you decide to sort in the database, then to prevent DataStage from adding tsort operator ... either add sort stage with the option 'Don't sort if previously sorted' or add environment variable APT_SORT_INSERTION_CHECK_ONLY(and set it to True) in the job.

ArndW · Post by **ArndW** » Wed Jul 01, 2015 2:06 am

Adding the environment variable APT_SORT_INSERTION_CHECK_ONLY is only one of the options available, the preferred method is indeed to add what I call a "dummy" sort stage which adds the appropriate "don't sort" option. One still needs to make sure that the data is correctly partitioned when running in parallel.