sort key problem

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bart12872
Participant
Posts: 82
Joined: Fri Jan 19, 2007 5:38 pm

sort key problem

Post by bart12872 »

Hi,

A job cause me a huge grief because of the use of sorts.

I explain me :

In my job, I have a input with a huge numbers of lines ordered by col1.
Then, i make an inner join and an agregation on col1,col2,col3,col4.
So i use a sort stage (key sort col1,col2,col3,col4) with col1 previously sorted.

So, at this moment data as sort by col1,col2,col3,col4

After that, I need to sort by col2,col3,col4 only.

Is there a method to do this without cut the dataflow ?
Do I have to write in a dataset all data and then sort ?

thanks,
martin.
mk_ds09
Participant
Posts: 72
Joined: Sun Jan 25, 2009 4:50 pm
Location: Pune

Post by mk_ds09 »

In order to have better design of the job..

1. There is join stage in the job...It is advised that is you are having huge unsorted data, u can use lookup stage.. ( of course ..here the other link where you are putting the join should have less rows which can fit in your physical memory or performance will degrade again ! )

2.Do not use stable sort which is much more expensive..

3.Use restirct memory clause in sort, which can improve the performance.

you have mentioned that writing the dataset and then sorting..
are you using database stages currently ?

-------
MK
shamshad
Premium Member
Premium Member
Posts: 147
Joined: Wed Aug 25, 2004 1:39 pm
Location: Detroit,MI

Post by shamshad »

Martin,

This might not be the answer you looking for but whenever we have to sort and rearrange huge amount of data, we do it via a UNIX script rather than using the ETL Tool.

UNIX does these operation fairly quickly and efficiently and we never had any memory issues etc. The only catch is you will have to add few extra
steps in your Sequence like calling Shell script from Master Sequence etc.

After all no ETL tool is built to handle almost every situation efficiently.
Datawarehouse Consultant
bart12872
Participant
Posts: 82
Joined: Fri Jan 19, 2007 5:38 pm

Post by bart12872 »

mk_ds09 wrote:In order to have better design of the job..

1. There is join stage in the job...It is advised that is you are having huge unsorted data, u can use lookup stage.. ( of course ..here the other link where you are putting the join should have less rows which can fit in your physical memory or performance will degrade again ! )

2.Do not use stable sort which is much more expensive..

3.Use restirct memory clause in sort, which can improve the performance.

you have mentioned that writing the dataset and then sorting..
are you using database stages currently ?

-------
MK
Thanks for your response.
1-well, I didn't developped key join in my join stage. The key is col1,col2. So as my input are sorted by col1, col2. The dataflow is not broken.
2- I didn't use the stable sort. In fact, I never find a situation with the need of stable sort.
3- I must admit I doesn't consider this parameter. I always let it to 20MB, the default value. Can you tell me me how you define it ?

no, i didn't use database, except to extract data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In the Sort stage mark the sort mode for Col1 "don't sort, already sorted" and sort normally by Col2, Col3 and Col4.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply