Hi all,
I know this may be a primitive question and i think i know the answer.
Scenario: Joining data between a DataSet and a table that exists in the database.
In this scenario if i sort the data in the databse using order by, do i need to still sort the data in DataStage? As i far i understand it is a "NO" meaning sorting in DB is good enough. Sort the data in DB and specify the input partitioning as AUTO
I guess this is prettu much a stanadrd to harness the DB sorting power. Let me know if any of you think otherwise?
Is Sorting in DataStage Required when Selecting from DB
Moderators: chulett, rschirm, roy
Yes... you can harness the power of DB sorting in your DS job... HOWEVER... you'll need to consider...
#1. If you have to join to sources then you need to have your data key partitioned, and a re-partitioning can destroy your sort order.
#2. Database collation vs. DS project collation. The sorting order on the database might be different than from the one in DS
#1. If you have to join to sources then you need to have your data key partitioned, and a re-partitioning can destroy your sort order.
#2. Database collation vs. DS project collation. The sorting order on the database might be different than from the one in DS
_________________
- Susanne
- Susanne
Assumption is that the dataset is already sorted? Or else you need to sort it too.
Using ORDER BY may be better if your source database sorts in parallel and has more processing power than your DataStage server, otherwise consider DataStage Sort stage and avoiding ORDER BY.
Using ORDER BY may be better if your source database sorts in parallel and has more processing power than your DataStage server, otherwise consider DataStage Sort stage and avoiding ORDER BY.
Choose a job you love, and you will never have to work a day in your life. - Confucius
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
There are two things to consider. 1) Where should sorting occur? 2) How do you avoid sorting it twice? The problem is that even if you put a SORT BY statement into your job DataStage does not know it is there and may try to sort the data again.
If the database table has a good key and you can sort using that key and the performance of the SQL statement with SORT BY in it is fast then a database is a good place to do a sort. If the sort places too much load on the databases or it makes the SQL select statement too slow then you can move the sort into DataStage.
DataStage will automatically sort data if you pass data through a stage that needs sorting such as Join or Remove Duplicate. It inserts "tsort" operators automatically into the compiled job. You can see these operators if you turn on additional job reporting the DataStage environment variables to see a score dump. This can mean sorting the data twice, once in the database and once in the job.
There are three ways to remove automatic sort insertion from your job.
1) Project level: in DataStage Administrator set the variable APT_NO_SORT_INSERTION to TRUE and this will stop sorts being added automatically to all jobs in a project. This is a dangerous thing to do as some of those jobs may need sort insertions.
2) Job level: add APT_NO_SORT_INSERTION as a local job property and set it to TRUE to avoid any sorting inside that job. This can also be dangerous if the developer has sorted the input data but forgotten to sort reference data.
3) Link level: You can add a Sort stage into your DataStage job after the database input stage and set it to the value of "Don't Sort Already Sorted". This may seem silly but it is the way to tell DataStage (and developers) that this particular stream of data is already sorted.
If the database table has a good key and you can sort using that key and the performance of the SQL statement with SORT BY in it is fast then a database is a good place to do a sort. If the sort places too much load on the databases or it makes the SQL select statement too slow then you can move the sort into DataStage.
DataStage will automatically sort data if you pass data through a stage that needs sorting such as Join or Remove Duplicate. It inserts "tsort" operators automatically into the compiled job. You can see these operators if you turn on additional job reporting the DataStage environment variables to see a score dump. This can mean sorting the data twice, once in the database and once in the job.
There are three ways to remove automatic sort insertion from your job.
1) Project level: in DataStage Administrator set the variable APT_NO_SORT_INSERTION to TRUE and this will stop sorts being added automatically to all jobs in a project. This is a dangerous thing to do as some of those jobs may need sort insertions.
2) Job level: add APT_NO_SORT_INSERTION as a local job property and set it to TRUE to avoid any sorting inside that job. This can also be dangerous if the developer has sorted the input data but forgotten to sort reference data.
3) Link level: You can add a Sort stage into your DataStage job after the database input stage and set it to the value of "Don't Sort Already Sorted". This may seem silly but it is the way to tell DataStage (and developers) that this particular stream of data is already sorted.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
One more thing.
A sort supported by a B-tree index will outperform anything that DataStage can do, because a B-tree index is already stored in sorted order and simply has to be scanned.
A sort supported by a B-tree index will outperform anything that DataStage can do, because a B-tree index is already stored in sorted order and simply has to be scanned.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.