Duplicates/Sorting question

Emilio · Post by **Emilio** » Thu Sep 22, 2005 7:51 am

Hello,
How can I get rid of duplicates on columns using datastage?

My file looks like this:

ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
NONE,100,,N/A,,,,,AUCTION,0
NONE,0,,N/A,,,,,0,0
NONE,0,,0,,,,,FAM,0
NONE,0,,0,,,,,FAM,NO INCOME/ FULL ASSET
N/A,ERROR,,0,,,,,0,XX
N/A,ERROR,,0,,,,,0,0

Needs to look like this:

ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
N/A,ERROR,,,,,,,FAM,NO INCOME/ FULL ASSET
,,,,,,,,,100
,,,,,,,,,XX

the row order does not matter, just need to get rid of duplicates in each column.
Your help is much apprechated!!

Thanks

Mike · Post by **Mike** » Thu Sep 22, 2005 8:13 am

Hi Emelio,

I didn't exactly follow your sample data. You can do a search on the forum to find lots of info on this topic.

Couple of options:
1) Pass through a hash file taking advantage of its destructive overwrite behavior (last in wins)
2) Use the aggregator stage

If the data is sorted by key:
3) Use stage variables to test for a key change combined with output constraints

If the data volume is very large, you will probably want to feed an aggregator stage with sorted data.

I personally like option 1.

Mike

pnchowdary · Post by **pnchowdary** » Thu Sep 22, 2005 8:17 am

Hi Emilio,

There are several methods to remove the duplicates. Below are some sample approaches

1) You can define all the columns that you want to remove duplicates from as key columns and write them into a hash file.

2) You can use the aggregator and the count > 1 to filter duplicates

3) You can also use stage variables to achieve this

This topic has been discussed several times in this forum. Please use the search facility for more details.

ArndW · Post by **ArndW** » Thu Sep 22, 2005 8:23 am

Emilio,

are you looking to remove duplicates on just the first column, that is what your sample seems to be doing. In that case if your data is sorted then you can do it with a stage variable. The prerequisite of doing this in a normal stage is that the data is sorted by the columns you want to use as your criteria.

Emilio · Post by **Emilio** » Thu Sep 22, 2005 9:19 am

Thank you for your replies:
I have some questions:
Naveen:
1) I was trying the aggregator to do the count >1. Where do you enter the >1 criteria in the aggregator?

2) How do you use the stage variables? explain this a little more.

Mile:
If this does not work I would like to try the hush file method...Do you create the hash file and bring it back and that eliminates duplicates? wow!!...that would be great.

Arnd:
I want to remove duplicates in all the column. the oder afterwards it does not matter.
Thanks

Mike · Post by **Mike** » Thu Sep 22, 2005 9:34 am

Emilio,

I don't think I quite understand your requirement. Are you saying that all of the columns comprise the key? Or are you saying that you want to remove duplicates from each column independently? If the latter case, what criteria would be used to create rows from all of the unique sets of column values.

A hash file only has one row for each key value. Whenever you insert a row with a key value that already exists, the existing row is overwritten with the new row.

Can you supply a more simplified example? For instance, an example with 3 columns and 3 rows.

Mike

Emilio · Post by **Emilio** » Thu Sep 22, 2005 9:41 am

Mike here is a better 3 column sample:
ArmConvOpt ArmIndexCode ArmNegAmrFlag
NONE 100
NONE 200
NONE 100
NONE AUCTION
NONE AUCTION

Needs to look like this:
ArmConvOpt ArmIndexCode ArmNegAmrFlag
NONE 100 AUCTION
200
I hope this makes more sense...

Emilio · Post by **Emilio** » Thu Sep 22, 2005 9:55 am

The sample did no come up too well:
Here is a comma delimited one:
ArmConvOpt, ArmIndexCode, ArmNegAmrFlag
NONE, 100,
NONE, 200,
NONE, 100,
NONE,,AUCTION
NONE,,AUCTION

Needs to look like this:
ArmConvOpt, ArmIndexCode, ArmNegAmrFlag
NONE, 100, AUCTION
,200,

Mike · Post by **Mike** » Thu Sep 22, 2005 10:13 am

Emilio,

Thanks. I think I understand your requirement now. I have a couple questions that will influence a job design:

1) How many columns are there?
2) Will the number of columns always be the same?
3) How many rows will typically be processed?

Mike

Emilio · Post by **Emilio** » Thu Sep 22, 2005 10:21 am

There are 10 columns and probably about 10 to 100 rows....
Emilio

pnchowdary · Post by **pnchowdary** » Thu Sep 22, 2005 10:35 am

Hi Emilio,

In the aggreagtor what I meant by this "You can use the aggregator and the count > 1" was the below

For each key column you need to define a dummy column and group on the key column and use the count aggregate function to populate the dummy column. After you do that, you need to write them to a sequential file and filter the records based on constraint count>1 in a transformer and store those keys in a hash file and use them against the original input to eliminate the duplicates.

For achieving the same with stage variables, its kind of messy, especially since you want to do it on all columns.

The easiest option for you is Hash file option.

Emilio · Post by **Emilio** » Thu Sep 22, 2005 11:29 am

Naveen,
When creating a hash file (I am a little familiar with this process), from the transformer I create the hash file and from the hash file, can I go back to another transformer or to a file?
Also, do I need to make all my columns key for the hash file?
Thanks

pnchowdary · Post by **pnchowdary** » Thu Sep 22, 2005 12:31 pm

Hi Emilio,

If you want to remove duplicates in each column, then you need to make all of them as key columns, before you write them to a hash file. But, you need to keep in mind, that in this method follows destructive write and only the last one wins.

After you get all the unique keys into a hash file, you can use a tranformer in which, you will be reading the input and performing the lookup using the hash file generated in the previous step.

Code: Select all

 Input                  Output
---------> Transformer ------------>  
                 ^ 
                 !
                 !  Lookup
                 !
             Hash File

rleishman · Post by **rleishman** » Thu Sep 22, 2005 5:31 pm

Emilio,

I can't believe the most interesting topic of the week was posted when I was asleep. Well now you Yanks are going home to bed and it's Australia's turn...

Here's my understanding of your requirement: given an input file with n columns, you want to produce an output file where no single value of any of the n columns is repeated.
What is not clear to me is whether the number of output rows is important. For example, would you accept the following as an alternative output for your example above?

Code: Select all

NONE, 100, 
, 200, 
,,AUCTION

If so, then I think I have a solution. I have attached a routine below that I used for a different problem; I just wanted a generic method of removing duplicates. Here's what you do:
- Copy the code below and create a DS Routine called IsUnique with 4 parameters: TEMPDIR, HashFile, Initialize, and Val
- In your DS Job, create a stage variable for each output column with expressions of the form:

Code: Select all

If IsUnique("X:\xxx\yyy\temp", DSJOB : "<col_name>", Initialize, In_link.col) Then 
    In_link.col 
Else 
    @NULL

- Create another stage variable called Initialize BELOW the other stage variables. Set the default value to 1, and the expression to 0. This means that it will be 1 for the first row processed (causing the IsUnique hash file mechanism to be initialized) and then 0 for each subsequent row.
- Map each of the stage variables to their respective columns.

How does it work? For each column, a different hash file is created in a temp directory of your choosing. You pass a value into the routine (and tell it which hash file to look in) and it tells you whether it has seen the value before (success) or not (failure). If the routine returns success, then you send the column value to the output link, otherwise send @NULL.

A couple of notes:
- I wrote this for Unix only. There is a Unix command I have highlighted with stars that you will have to translate into DOS. It is basically this:

Code: Select all

CD <TEMPDIR>
DEL <TEMPDIR>\<HashFile>
%DSHOME%\bin\mkdbfile <HashFile> 30 1 4 20 50 80 1628

although in Unix I separated the commands in such a way that if one failed the next would not run.
- I'm a UV BASIC beginner. Hopefully some bigger brains than mine will critique this routine (nicely of course

)so that I can improve it - listen to them.

- I originally designed it without the ability handle NULLs in the Val param. I made a change in this version that should work but didn't test it.
- I had to remove some references to other custom routines in our library. It should still compile, but...

Good luck.

Code: Select all


      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = TEMPDIR : "\" : HashFile


      * If the Initialize parameter is set, remove and rebuild the hash file
      If Initialize Then

         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

****************************
         StrCommand = "cd " : TEMPDIR : " && rm -rf " : HashFile : " && $DSHOME/bin/mkdbfile " : HashFile : " 30 1 4 20 50 80 1628"
****************************

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If UniqFileName <> HashPath Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
               Ans = -1
            End
         End



         If Ans = 0 And Not(IsNull(Val)) Then

            * Check if Val already exists
            Readu ValExists From UniqFileHandle, Val
            Then
               * Return FALSE if exists
               Ans = 0
            End
            Else
               * Return TRUE and add to file if not exists
               Writeu Val On UniqFileHandle, Val
               Then
                  Ans = 1
               End
               Else
                  Call DSLogWarn("Unable to write hash file " : HashFile, "IsUnique")
                  Ans = -1
               End
            End
         End
      End

[/code]

htrisakti3 · Post by **htrisakti3** » Thu Sep 22, 2005 6:16 pm

I saw that from your example, you need to keep "the most complete combination" ??

ie:
NONE, 100, AUCTION -- this row is combination of 1,3,4,5
if you use hash (destructive write) i think you'll endup with
NONE, ,AUCTION

here's what I'd do:
(dunno how to make drawing based on ascii, so bare w/ me here)

1. use your file as input
2. transformer with hashfileA lookup
3. output to same hashfileA

in the lookup, you need to concatenate input file & hashfileA
ie:
input: NONE, 100, ''
+hash: NONE, '', AUCTION
needs to return: NONE, maxOf(100,'')=100, maxof('',AUCTION)=AUCTION

this way, the one written to hash will be the most complete combination once you get to the end of file

DSXchange

Duplicates/Sorting question

Duplicates/Sorting question

hash may not work?