Effective partition type for sorted input in Transformer

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Effective partition type for sorted input in Transformer

Post by apraman »

I have a simple job

CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)

The no of reader per node CFF stage is set to 1
I'm using Transformer due the need of some conversion activities are needed, and simultaneously I like to sort the input.
To sort the Input Link, i need to use certain partition type.
I have check the preseved the sorted order in the output link.

Which partition technique will be most effective.

Thanks in advance
Last edited by apraman on Mon Oct 17, 2005 10:21 pm, edited 2 times in total.
Arun
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Instead of a direct response, let me ask a question that has a direct bearing: what is the partitioning method of a sequential output file?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
Where have you planned to introduce the sort stage??

regards
kumar
apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Post by apraman »

Kumar,

I will not use any extra sort stage.

Why should I use sort stage if I can sort input with in Transformer and preseve the same sort order to the output of Transformer ?

Can you give me any valid reason for inclusion of sort stage wrt the posted scenario?
Arun
apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Post by apraman »

Hi Arnd,
ArndW wrote: What is the partitioning method of a sequential output file?
I am a blunt to answer your question..... :oops:

I think it all depend on the 'Number of Reader per node' / 'Read from Multilple Node' option which you set.

case 1: 1 reader/node
CFF/SEQ File stage ------------------------> Entire Partitioning Type

case 2: Multiple reader/node or Multiple nodes
CFF/SEQ File stage -----------------------> Range Partitioning Type

Please correct me if I am wrong .......... :D

But how it will help me? If I select any valid partitioning type in the INPUT link of the TRANSFORMER, it will repartition the partition provided by the OUTPUT of the CFF stage.
Arun
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You have answered a different question than the one asked.

When you are writing to a sequential file, what partitioning method is mandated?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Post by apraman »

Hi Ray,
ray.wurlod wrote:You have answered a different question than the one asked.

When you are writing to a sequential file, what partitioning method is mandated?
:? :? :? :?

I do not know if there is any use of partition (for Sequential data transfer)
while writing to Sequential File Stage. And there should not be.
Arun
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Apraman,

what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.
apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Post by apraman »

ArndW wrote:Apraman,

what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.
Thanks,

But I need to use a Partition type for Sorting Input of the Transformer.
For Sorting any data 'HASH' partitioning is the best.

DS Gurus Please correct me, if wrong to use HASH partition in this current context.
Arun
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's wrong. You're wrong. (You did request that.)

It's illegal to have more than one process write to the same sequential file.

This is not a DataStage rule; it's an operating system rule.

So what you want doesn't come into it - it's not possible.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
apraman
Participant
Posts: 47
Joined: Mon Sep 12, 2005 5:26 am

Post by apraman »

ray.wurlod wrote:It's wrong. You're wrong. (You did request that.)

It's illegal to have more than one process write to the same sequential file.
:evil: :evil: :evil: :evil:

I am getting your points, but I think I am unable to make you understand my point.

I have a single CFF stage with a EBCDIC file as a source
I have a single Sequential stage as ASCII file as target.

I need to do certain conversion and hence including a transformer between them and through it I am also sorting the input of the Transformer which is retreive from CFF stage.

<pre>
sorted input
CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)
</pre>

To sort the data of the input link of transformer stage as per datastage you need to select a Partition Type. I am preferring the type to be HASH

Now tell me what is wrong here?
8) 8) 8) 8)
Arun
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The partitioning algorithm in the sense we are talking about here is how the single records are distributed across the PX node processes. So if you have 4 nodes and choose round-robin, each process will get a row in order; if you choose hash on a field then a number if generated from that field using a hashing algorithmand then mod(nodes) used to distribute.

What Ray and I are saying is that your partitioning mode of choice is absolutely irrelevant since the file is sequential {variable length, since fixed length would make parallel reads possible} and thus you only have 1 process. No matter what method you choose your result will be the same.

Instead of going back and forth in this thread, why don't you just try it with different methods and if the result is different from what we've told you to expect then we can progress from there.
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

Apraman,

I may have misunderstood your question, but are you saying that the logic in your transformer requires the input data to be sorted?

If this is the case you can either place a sort before the transformer or use the partitioning / sorting on the input tab of the transformer.
In either case your choice of partitioning of the input data will depend on the logic in the transformer and the data. Presumably there are one or more fields in the input which you have to sort by and therefore records with identical values of this sort key should be processed on the same node. If this is the case you can hash partition on one or more of these keys. Do check that the data partitions evenly across the nodes rather than being skewed with many records on one node and not many on the others.

As you are then writing to a sequential file this is a single process as has been stated earlier in the thread. Therefore the data from the transformer running on several nodes will need to be collected on a single node and you must also decide if it needs to be sorted as well.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
Option 1:
Making transformer to operate in seqential mode, since both input and output are sequential file.

Option 2.
Inserting a Sort stage in front of transformer, and partiting data and sorting at that stage, and performing the login parallely(which is not there in perevious case). Then it gets written sequentially.

Wont the option2 be better. Coz all data will wait at the sort stage to get fully sorted. Once it is sorted, it will be oprated parallely. :roll:
Or as Ray quoted the parallel option of transformer wont work out for the seq output??

regards
kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Partitioning is completely irrelevant for output links.

Partitioning (and, perhaps, sorting) is relevant on input links. If you're writing to a sequential file, it must run in sequential mode.

If you don't need to sort, don't. You don't need to sort to write rows into a sequential file, unless the consumer of that file requires it to be sorted.

If you do need to sort, it doesn't matter whether you use a Sort stage or sorting as a property of the input link; it will block rows. It must block rows. Think about it.

The Sort stage gives a little more flexibility and control over consumption of memory than is available for input link sorting.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply