Effective partition type for sorted input in Transformer
Moderators: chulett, rschirm, roy
Effective partition type for sorted input in Transformer
I have a simple job
CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)
The no of reader per node CFF stage is set to 1
I'm using Transformer due the need of some conversion activities are needed, and simultaneously I like to sort the input.
To sort the Input Link, i need to use certain partition type.
I have check the preseved the sorted order in the output link.
Which partition technique will be most effective.
Thanks in advance
CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)
The no of reader per node CFF stage is set to 1
I'm using Transformer due the need of some conversion activities are needed, and simultaneously I like to sort the input.
To sort the Input Link, i need to use certain partition type.
I have check the preseved the sorted order in the output link.
Which partition technique will be most effective.
Thanks in advance
Last edited by apraman on Mon Oct 17, 2005 10:21 pm, edited 2 times in total.
Arun
Instead of a direct response, let me ask a question that has a direct bearing: what is the partitioning method of a sequential output file?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Hi Arnd,
I think it all depend on the 'Number of Reader per node' / 'Read from Multilple Node' option which you set.
case 1: 1 reader/node
CFF/SEQ File stage ------------------------> Entire Partitioning Type
case 2: Multiple reader/node or Multiple nodes
CFF/SEQ File stage -----------------------> Range Partitioning Type
Please correct me if I am wrong .......... :D
But how it will help me? If I select any valid partitioning type in the INPUT link of the TRANSFORMER, it will repartition the partition provided by the OUTPUT of the CFF stage.
I am a blunt to answer your question.....ArndW wrote: What is the partitioning method of a sequential output file?
I think it all depend on the 'Number of Reader per node' / 'Read from Multilple Node' option which you set.
case 1: 1 reader/node
CFF/SEQ File stage ------------------------> Entire Partitioning Type
case 2: Multiple reader/node or Multiple nodes
CFF/SEQ File stage -----------------------> Range Partitioning Type
Please correct me if I am wrong .......... :D
But how it will help me? If I select any valid partitioning type in the INPUT link of the TRANSFORMER, it will repartition the partition provided by the OUTPUT of the CFF stage.
Arun
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hi Ray,
I do not know if there is any use of partition (for Sequential data transfer)
while writing to Sequential File Stage. And there should not be.
ray.wurlod wrote:You have answered a different question than the one asked.
When you are writing to a sequential file, what partitioning method is mandated?
I do not know if there is any use of partition (for Sequential data transfer)
while writing to Sequential File Stage. And there should not be.
Arun
Apraman,
what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.
what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Thanks,ArndW wrote:Apraman,
what Ray and I are subtly trying to point out to you is that a sequential file can only be written to by one process; so the partitioning information you are asking for is irrelevant.
But I need to use a Partition type for Sorting Input of the Transformer.
For Sorting any data 'HASH' partitioning is the best.
DS Gurus Please correct me, if wrong to use HASH partition in this current context.
Arun
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It's wrong. You're wrong. (You did request that.)
It's illegal to have more than one process write to the same sequential file.
This is not a DataStage rule; it's an operating system rule.
So what you want doesn't come into it - it's not possible.
It's illegal to have more than one process write to the same sequential file.
This is not a DataStage rule; it's an operating system rule.
So what you want doesn't come into it - it's not possible.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod wrote:It's wrong. You're wrong. (You did request that.)
It's illegal to have more than one process write to the same sequential file.
I am getting your points, but I think I am unable to make you understand my point.
I have a single CFF stage with a EBCDIC file as a source
I have a single Sequential stage as ASCII file as target.
I need to do certain conversion and hence including a transformer between them and through it I am also sorting the input of the Transformer which is retreive from CFF stage.
<pre>
sorted input
CFF (EBCDIC) stage -----> Transformer -------> Sequential (ASCII)
</pre>
To sort the data of the input link of transformer stage as per datastage you need to select a Partition Type. I am preferring the type to be HASH
Now tell me what is wrong here?
Arun
The partitioning algorithm in the sense we are talking about here is how the single records are distributed across the PX node processes. So if you have 4 nodes and choose round-robin, each process will get a row in order; if you choose hash on a field then a number if generated from that field using a hashing algorithmand then mod(nodes) used to distribute.
What Ray and I are saying is that your partitioning mode of choice is absolutely irrelevant since the file is sequential {variable length, since fixed length would make parallel reads possible} and thus you only have 1 process. No matter what method you choose your result will be the same.
Instead of going back and forth in this thread, why don't you just try it with different methods and if the result is different from what we've told you to expect then we can progress from there.
What Ray and I are saying is that your partitioning mode of choice is absolutely irrelevant since the file is sequential {variable length, since fixed length would make parallel reads possible} and thus you only have 1 process. No matter what method you choose your result will be the same.
Instead of going back and forth in this thread, why don't you just try it with different methods and if the result is different from what we've told you to expect then we can progress from there.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Apraman,
I may have misunderstood your question, but are you saying that the logic in your transformer requires the input data to be sorted?
If this is the case you can either place a sort before the transformer or use the partitioning / sorting on the input tab of the transformer.
In either case your choice of partitioning of the input data will depend on the logic in the transformer and the data. Presumably there are one or more fields in the input which you have to sort by and therefore records with identical values of this sort key should be processed on the same node. If this is the case you can hash partition on one or more of these keys. Do check that the data partitions evenly across the nodes rather than being skewed with many records on one node and not many on the others.
As you are then writing to a sequential file this is a single process as has been stated earlier in the thread. Therefore the data from the transformer running on several nodes will need to be collected on a single node and you must also decide if it needs to be sorted as well.
I may have misunderstood your question, but are you saying that the logic in your transformer requires the input data to be sorted?
If this is the case you can either place a sort before the transformer or use the partitioning / sorting on the input tab of the transformer.
In either case your choice of partitioning of the input data will depend on the logic in the transformer and the data. Presumably there are one or more fields in the input which you have to sort by and therefore records with identical values of this sort key should be processed on the same node. If this is the case you can hash partition on one or more of these keys. Do check that the data partitions evenly across the nodes rather than being skewed with many records on one node and not many on the others.
As you are then writing to a sequential file this is a single process as has been stated earlier in the thread. Therefore the data from the transformer running on several nodes will need to be collected on a single node and you must also decide if it needs to be sorted as well.
Hi,
Option 1:
Making transformer to operate in seqential mode, since both input and output are sequential file.
Option 2.
Inserting a Sort stage in front of transformer, and partiting data and sorting at that stage, and performing the login parallely(which is not there in perevious case). Then it gets written sequentially.
Wont the option2 be better. Coz all data will wait at the sort stage to get fully sorted. Once it is sorted, it will be oprated parallely.
Or as Ray quoted the parallel option of transformer wont work out for the seq output??
regards
kumar
Option 1:
Making transformer to operate in seqential mode, since both input and output are sequential file.
Option 2.
Inserting a Sort stage in front of transformer, and partiting data and sorting at that stage, and performing the login parallely(which is not there in perevious case). Then it gets written sequentially.
Wont the option2 be better. Coz all data will wait at the sort stage to get fully sorted. Once it is sorted, it will be oprated parallely.
Or as Ray quoted the parallel option of transformer wont work out for the seq output??
regards
kumar
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Partitioning is completely irrelevant for output links.
Partitioning (and, perhaps, sorting) is relevant on input links. If you're writing to a sequential file, it must run in sequential mode.
If you don't need to sort, don't. You don't need to sort to write rows into a sequential file, unless the consumer of that file requires it to be sorted.
If you do need to sort, it doesn't matter whether you use a Sort stage or sorting as a property of the input link; it will block rows. It must block rows. Think about it.
The Sort stage gives a little more flexibility and control over consumption of memory than is available for input link sorting.
Partitioning (and, perhaps, sorting) is relevant on input links. If you're writing to a sequential file, it must run in sequential mode.
If you don't need to sort, don't. You don't need to sort to write rows into a sequential file, unless the consumer of that file requires it to be sorted.
If you do need to sort, it doesn't matter whether you use a Sort stage or sorting as a property of the input link; it will block rows. It must block rows. Think about it.
The Sort stage gives a little more flexibility and control over consumption of memory than is available for input link sorting.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.