Is Sorting in DataStage Required when Selecting from DB

saikir · Post by **saikir** » Wed Sep 21, 2011 7:52 am

Hi all,

I know this may be a primitive question and i think i know the answer.

Scenario: Joining data between a DataSet and a table that exists in the database.

In this scenario if i sort the data in the databse using order by, do i need to still sort the data in DataStage? As i far i understand it is a "NO" meaning sorting in DB is good enough. Sort the data in DB and specify the input partitioning as AUTO

I guess this is prettu much a stanadrd to harness the DB sorting power. Let me know if any of you think otherwise?

arvind_ds · Post by **arvind_ds** » Wed Sep 21, 2011 8:15 am

Order by is enough, no need to sort it again.

suse_dk · Post by **suse_dk** » Wed Sep 21, 2011 8:18 am

Yes... you can harness the power of DB sorting in your DS job... HOWEVER... you'll need to consider...

#1. If you have to join to sources then you need to have your data key partitioned, and a re-partitioning can destroy your sort order.

#2. Database collation vs. DS project collation. The sorting order on the database might be different than from the one in DS

qt_ky · Post by **qt_ky** » Wed Sep 21, 2011 10:23 pm

Assumption is that the dataset is already sorted? Or else you need to sort it too.

Using ORDER BY may be better if your source database sorts in parallel and has more processing power than your DataStage server, otherwise consider DataStage Sort stage and avoiding ORDER BY.

vmcburney · Post by **vmcburney** » Wed Sep 21, 2011 11:11 pm

There are two things to consider. 1) Where should sorting occur? 2) How do you avoid sorting it twice? The problem is that even if you put a SORT BY statement into your job DataStage does not know it is there and may try to sort the data again.

If the database table has a good key and you can sort using that key and the performance of the SQL statement with SORT BY in it is fast then a database is a good place to do a sort. If the sort places too much load on the databases or it makes the SQL select statement too slow then you can move the sort into DataStage.

DataStage will automatically sort data if you pass data through a stage that needs sorting such as Join or Remove Duplicate. It inserts "tsort" operators automatically into the compiled job. You can see these operators if you turn on additional job reporting the DataStage environment variables to see a score dump. This can mean sorting the data twice, once in the database and once in the job.

There are three ways to remove automatic sort insertion from your job.
1) Project level: in DataStage Administrator set the variable APT_NO_SORT_INSERTION to TRUE and this will stop sorts being added automatically to all jobs in a project. This is a dangerous thing to do as some of those jobs may need sort insertions.
2) Job level: add APT_NO_SORT_INSERTION as a local job property and set it to TRUE to avoid any sorting inside that job. This can also be dangerous if the developer has sorted the input data but forgotten to sort reference data.
3) Link level: You can add a Sort stage into your DataStage job after the database input stage and set it to the value of "Don't Sort Already Sorted". This may seem silly but it is the way to tell DataStage (and developers) that this particular stream of data is already sorted.

SURA · Post by **SURA** » Wed Sep 21, 2011 11:25 pm

Perfect answer.

Even thoug sort is happened outside the datastage, still need to inform DS about how it should behave.

DS User

ray.wurlod · Post by **ray.wurlod** » Thu Sep 22, 2011 12:57 am

One more thing.

A sort supported by a B-tree index will outperform anything that DataStage can do, because a B-tree index is already stored in sorted order and simply has to be scanned.

saikir · Post by **saikir** » Thu Sep 22, 2011 6:43 am

Thanks to Vincent, Ray and all others. I am glad on all the answers and far more assured about the way foward. Thanks again

This has been a long standing discussion of mine(with my Colleagues) about understanding and equally leveraging the DB and DataStage