Is Sorting in DataStage Required when Selecting from DB

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Is Sorting in DataStage Required when Selecting from DB

Post by saikir »

Hi all,

I know this may be a primitive question and i think i know the answer.

Scenario: Joining data between a DataSet and a table that exists in the database.

In this scenario if i sort the data in the databse using order by, do i need to still sort the data in DataStage? As i far i understand it is a "NO" meaning sorting in DB is good enough. Sort the data in DB and specify the input partitioning as AUTO

I guess this is prettu much a stanadrd to harness the DB sorting power. Let me know if any of you think otherwise?
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

Order by is enough, no need to sort it again.
Arvind
suse_dk
Participant
Posts: 93
Joined: Thu Aug 11, 2011 6:18 am
Location: Denmark

Post by suse_dk »

Yes... you can harness the power of DB sorting in your DS job... HOWEVER... you'll need to consider...

#1. If you have to join to sources then you need to have your data key partitioned, and a re-partitioning can destroy your sort order.

#2. Database collation vs. DS project collation. The sorting order on the database might be different than from the one in DS
_________________
- Susanne
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Assumption is that the dataset is already sorted? Or else you need to sort it too.

Using ORDER BY may be better if your source database sorts in parallel and has more processing power than your DataStage server, otherwise consider DataStage Sort stage and avoiding ORDER BY.
Choose a job you love, and you will never have to work a day in your life. - Confucius
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

There are two things to consider. 1) Where should sorting occur? 2) How do you avoid sorting it twice? The problem is that even if you put a SORT BY statement into your job DataStage does not know it is there and may try to sort the data again.

If the database table has a good key and you can sort using that key and the performance of the SQL statement with SORT BY in it is fast then a database is a good place to do a sort. If the sort places too much load on the databases or it makes the SQL select statement too slow then you can move the sort into DataStage.

DataStage will automatically sort data if you pass data through a stage that needs sorting such as Join or Remove Duplicate. It inserts "tsort" operators automatically into the compiled job. You can see these operators if you turn on additional job reporting the DataStage environment variables to see a score dump. This can mean sorting the data twice, once in the database and once in the job.

There are three ways to remove automatic sort insertion from your job.
1) Project level: in DataStage Administrator set the variable APT_NO_SORT_INSERTION to TRUE and this will stop sorts being added automatically to all jobs in a project. This is a dangerous thing to do as some of those jobs may need sort insertions.
2) Job level: add APT_NO_SORT_INSERTION as a local job property and set it to TRUE to avoid any sorting inside that job. This can also be dangerous if the developer has sorted the input data but forgotten to sort reference data.
3) Link level: You can add a Sort stage into your DataStage job after the database input stage and set it to the value of "Don't Sort Already Sorted". This may seem silly but it is the way to tell DataStage (and developers) that this particular stream of data is already sorted.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Perfect answer.

Even thoug sort is happened outside the datastage, still need to inform DS about how it should behave.

DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

One more thing.

A sort supported by a B-tree index will outperform anything that DataStage can do, because a B-tree index is already stored in sorted order and simply has to be scanned.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Thanks to Vincent, Ray and all others. I am glad on all the answers and far more assured about the way foward. Thanks again

This has been a long standing discussion of mine(with my Colleagues) about understanding and equally leveraging the DB and DataStage
Post Reply