Effective partition type for sorted input in Transformer

apraman · Post by **apraman** » Thu Oct 06, 2005 6:35 am

I have a simple job

CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)

The no of reader per node CFF stage is set to 1
I'm using Transformer due the need of some conversion activities are needed, and simultaneously I like to sort the input.
To sort the Input Link, i need to use certain partition type.
I have check the preseved the sorted order in the output link.

Which partition technique will be most effective.

Thanks in advance

ArndW · Post by **ArndW** » Thu Oct 06, 2005 7:02 am

Instead of a direct response, let me ask a question that has a direct bearing: what is the partitioning method of a sequential output file?

kumar_s · Post by **kumar_s** » Thu Oct 06, 2005 10:14 pm

Hi,
Where have you planned to introduce the sort stage??

regards
kumar

apraman · Post by **apraman** » Thu Oct 06, 2005 10:23 pm

Kumar,

I will not use any extra sort stage.

Why should I use sort stage if I can sort input with in Transformer and preseve the same sort order to the output of Transformer ?

Can you give me any valid reason for inclusion of sort stage wrt the posted scenario?

apraman · Post by **apraman** » Thu Oct 06, 2005 10:55 pm

Hi Arnd,

ArndW wrote: What is the partitioning method of a sequential output file?

I am a blunt to answer your question.....

I think it all depend on the 'Number of Reader per node' / 'Read from Multilple Node' option which you set.

case 1: 1 reader/node
CFF/SEQ File stage ------------------------> Entire Partitioning Type

case 2: Multiple reader/node or Multiple nodes
CFF/SEQ File stage -----------------------> Range Partitioning Type

Please correct me if I am wrong .......... :D

But how it will help me? If I select any valid partitioning type in the INPUT link of the TRANSFORMER, it will repartition the partition provided by the OUTPUT of the CFF stage.

ray.wurlod · Post by **ray.wurlod** » Fri Oct 07, 2005 12:10 am

You have answered a different question than the one asked.

When you are writing to a sequential file, what partitioning method is mandated?

apraman · Post by **apraman** » Fri Oct 07, 2005 2:49 am

Hi Ray,

ray.wurlod wrote:You have answered a different question than the one asked.

When you are writing to a sequential file, what partitioning method is mandated?

I do not know if there is any use of partition (for Sequential data transfer)
while writing to Sequential File Stage. And there should not be.

ArndW · Post by **ArndW** » Fri Oct 07, 2005 6:03 am

Apraman,

what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.

apraman · Post by **apraman** » Sun Oct 16, 2005 9:55 pm

ArndW wrote:Apraman,

what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.

Thanks,

But I need to use a Partition type for Sorting Input of the Transformer.
For Sorting any data 'HASH' partitioning is the best.

DS Gurus Please correct me, if wrong to use HASH partition in this current context.

ray.wurlod · Post by **ray.wurlod** » Mon Oct 17, 2005 12:33 am

It's wrong. You're wrong. (You did request that.)

It's illegal to have more than one process write to the same sequential file.

This is not a DataStage rule; it's an operating system rule.

So what you want doesn't come into it - it's not possible.

apraman · Post by **apraman** » Mon Oct 17, 2005 2:33 am

ray.wurlod wrote:It's wrong. You're wrong. (You did request that.)

It's illegal to have more than one process write to the same sequential file.

I am getting your points, but I think I am unable to make you understand my point.

I have a single CFF stage with a EBCDIC file as a source
I have a single Sequential stage as ASCII file as target.

I need to do certain conversion and hence including a transformer between them and through it I am also sorting the input of the Transformer which is retreive from CFF stage.

<pre>
sorted input
CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)
</pre>

To sort the data of the input link of transformer stage as per datastage you need to select a Partition Type. I am preferring the type to be HASH

Now tell me what is wrong here?

ArndW · Post by **ArndW** » Mon Oct 17, 2005 2:54 am

The partitioning algorithm in the sense we are talking about here is how the single records are distributed across the PX node processes. So if you have 4 nodes and choose round-robin, each process will get a row in order; if you choose hash on a field then a number if generated from that field using a hashing algorithmand then mod(nodes) used to distribute.

What Ray and I are saying is that your partitioning mode of choice is absolutely irrelevant since the file is sequential {variable length, since fixed length would make parallel reads possible} and thus you only have 1 process. No matter what method you choose your result will be the same.

Instead of going back and forth in this thread, why don't you just try it with different methods and if the result is different from what we've told you to expect then we can progress from there.

thompsonp · Post by **thompsonp** » Mon Oct 17, 2005 6:27 am

Apraman,

I may have misunderstood your question, but are you saying that the logic in your transformer requires the input data to be sorted?

If this is the case you can either place a sort before the transformer or use the partitioning / sorting on the input tab of the transformer.
In either case your choice of partitioning of the input data will depend on the logic in the transformer and the data. Presumably there are one or more fields in the input which you have to sort by and therefore records with identical values of this sort key should be processed on the same node. If this is the case you can hash partition on one or more of these keys. Do check that the data partitions evenly across the nodes rather than being skewed with many records on one node and not many on the others.

As you are then writing to a sequential file this is a single process as has been stated earlier in the thread. Therefore the data from the transformer running on several nodes will need to be collected on a single node and you must also decide if it needs to be sorted as well.

kumar_s · Post by **kumar_s** » Mon Oct 17, 2005 7:22 am

Hi,
Option 1:
Making transformer to operate in seqential mode, since both input and output are sequential file.

Option 2.
Inserting a Sort stage in front of transformer, and partiting data and sorting at that stage, and performing the login parallely(which is not there in perevious case). Then it gets written sequentially.

Wont the option2 be better. Coz all data will wait at the sort stage to get fully sorted. Once it is sorted, it will be oprated parallely.

Or as Ray quoted the parallel option of transformer wont work out for the seq output??

regards
kumar

ray.wurlod · Post by **ray.wurlod** » Mon Oct 17, 2005 3:53 pm

Partitioning is completely irrelevant for output links.

Partitioning (and, perhaps, sorting) is relevant on input links. If you're writing to a sequential file, it must run in sequential mode.

If you don't need to sort, don't. You don't need to sort to write rows into a sequential file, unless the consumer of that file requires it to be sorted.

If you do need to sort, it doesn't matter whether you use a Sort stage or sorting as a property of the input link; it will block rows. It must block rows. Think about it.

The Sort stage gives a little more flexibility and control over consumption of memory than is available for input link sorting.