Add column to input

kumarjit · Post by **kumarjit** » Tue Mar 24, 2015 11:14 pm

Hello All.

I've a source text file with data like below:

Field1
A
A
B
C
C
D

With this data, I have to generate a temporary dataset, as below:

Code: Select all

Field1 Field2
A      Y 
A      Y
B      N
C      Y
C      Y
D      N

Field2 is populated based on the following logic:
1.If the corresponding Field1 value has duplicates, the Y
2. Else N

The way I know to do this is(maybe a bit crude)

Code: Select all

1. Create a parallel job as the one below
             Seq file2
               |
               |
            Aggregator(record count, group by Field1)
               |
               |
Seq file1-----Join-----------------Seq File2(write the join output)
             based on Field1

2. Execute a Unix command as post job sub routine to add a column which performs the following: 
a. For record where record count is >1 add a new field with value Y
b. For record where record count is =1 add a new field with value N

But I'm to create a dataset output, and not any text file, and Unix commands do not work on dataset files.
As because I'm to introduce a new column, can't Column Generator stage serve this purpose, without using any Transformer/Sort?

Please help.

Regards,
Kumarjit.

udayanguha · Post by **udayanguha** » Wed Mar 25, 2015 6:54 am

You can use a sort stage to generate key change column. Then in the transformer through stage variables check for change in key change column and assign value accordingly.
If key change column is '0', assign Y. If previous key change column was 0 and current is '1', assign 'Y' else 'N'

kumarjit · Post by **kumarjit** » Wed Mar 25, 2015 7:34 am

If you had checked the last few lines of my post, you might remember that I'm trying to achieve this goal WITHOUT USING TRANFORMER/SORT STAGES.....

Anywayz, thanks for your feed.

Regards.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 25, 2015 3:34 pm

Are you permitted to use a Modify stage?

If yes use a column generator to generate "Y" for all rows then use the Modify stage to convert the NULL from left outer join into "N". And/or use a fork/join to split the streams based on the result of the join (or lookup).

kumarjit · Post by **kumarjit** » Wed Mar 25, 2015 8:50 pm

Thanks Ray, but I was not able to view you full post as it's Premium Content .

However, I will try and change the design to extent I was able to see in your post.

Regards,
Kumarjit.

AshishDevassy · Post by **AshishDevassy** » Thu Mar 26, 2015 8:41 am

Is there a reason that you dont wish to use transformer ?

kumarjit · Post by **kumarjit** » Fri Mar 27, 2015 10:18 pm

AshishDevassy wrote:Is there a reason that you dont wish to use transformer ?

I intend not to load the job, when the same can be achieved by other lightweight stages like column generator and/or modify stages.

Regards.

kumarjit · Post by **kumarjit** » Fri Mar 27, 2015 11:02 pm

What I'm trying to do is:

Code: Select all


           Seq file2 
               | 
               | 
            Aggregator(record count, group by Field1) 
               | 
               | 
Seq file1-----Join-----------------Column Generator Stage--------------------------Taget Dataset
				   (column to generate=F2, column method=Explicit)

In the Mapping tab of the column generator stage, add the following as the derivation for the output field F2
If(input.count=1 then'N' else 'Y')

But, can I create such derivations against an output column of the column generator stage?

Please advise.

Regards.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 29, 2015 3:56 pm

kumarjit wrote:I intend not to load the job, when the same can be achieved by other lightweight stages like column generator and/or modify stages.

You are relying on out-of-date knowledge. These days (since about version 8.7) the Transformer stage is no less efficient than most other stages, sometimes it's more efficient (for example than the Filter stage).

kumarjit · Post by **kumarjit** » Tue Mar 31, 2015 12:57 am

ray.wurlod wrote:You are relying on out-of-date knowledge. ...

I'm afraid to admit that its true to some sense. But if there are not more than 1K rows in the input, should I be trying something as time consuming as a transformer?

Please advise.

Regards.

priyadarshikunal · Post by **priyadarshikunal** » Tue Mar 31, 2015 1:48 am

What makes you think transformer is a time consuming stage. The weight of transformer has decreased over time and its not an expensive stage anymore. Now its even lighter than filter and switch stages. If you can combine work of 2 or more stages in transformer, it may give you better result as well.

I think you were not able to see the complete reply from Ray.

priyadarshikunal · Post by **priyadarshikunal** » Tue Mar 31, 2015 1:51 am

In addition, Join and Aggregation needs sorted as well as partitioned data, so it will insert a sort under the covers as well.

kumarjit · Post by **kumarjit** » Wed Apr 01, 2015 5:43 am

I'm not a premium member, and I'm not able to view Ray's posts.
Anyways, thanks to all of you for your time and suggestions.

Warm Regards.

qt_ky · Post by **qt_ky** » Wed Apr 01, 2015 8:01 am

Well by all means, sign up. It's incredibly affordable.