Duplicates/Sorting question

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Duplicates/Sorting question

Post by Emilio »

Hello,
How can I get rid of duplicates on columns using datastage? :?:

My file looks like this:

ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
NONE,100,,N/A,,,,,AUCTION,0
NONE,0,,N/A,,,,,0,0
NONE,0,,0,,,,,FAM,0
NONE,0,,0,,,,,FAM,NO INCOME/ FULL ASSET
N/A,ERROR,,0,,,,,0,XX
N/A,ERROR,,0,,,,,0,0

Needs to look like this:

ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
N/A,ERROR,,,,,,,FAM,NO INCOME/ FULL ASSET
,,,,,,,,,100
,,,,,,,,,XX

the row order does not matter, just need to get rid of duplicates in each column.
Your help is much apprechated!! :)
Thanks
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Hi Emelio,

I didn't exactly follow your sample data. You can do a search on the forum to find lots of info on this topic.

Couple of options:
1) Pass through a hash file taking advantage of its destructive overwrite behavior (last in wins)
2) Use the aggregator stage

If the data is sorted by key:
3) Use stage variables to test for a key change combined with output constraints

If the data volume is very large, you will probably want to feed an aggregator stage with sorted data.

I personally like option 1.

Mike
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Emilio,

There are several methods to remove the duplicates. Below are some sample approaches

1) You can define all the columns that you want to remove duplicates from as key columns and write them into a hash file.

2) You can use the aggregator and the count > 1 to filter duplicates

3) You can also use stage variables to achieve this

This topic has been discussed several times in this forum. Please use the search facility for more details.
Thanks,
Naveen
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Emilio,

are you looking to remove duplicates on just the first column, that is what your sample seems to be doing. In that case if your data is sorted then you can do it with a stage variable. The prerequisite of doing this in a normal stage is that the data is sorted by the columns you want to use as your criteria.
Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

Thank you for your replies:
I have some questions:
Naveen:
1) I was trying the aggregator to do the count >1. Where do you enter the >1 criteria in the aggregator? :?
2) How do you use the stage variables? explain this a little more.

Mile:
If this does not work I would like to try the hush file method...Do you create the hash file and bring it back and that eliminates duplicates? wow!!...that would be great. :o

Arnd:
I want to remove duplicates in all the column. the oder afterwards it does not matter.
Thanks :)
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Emilio,

I don't think I quite understand your requirement. Are you saying that all of the columns comprise the key? Or are you saying that you want to remove duplicates from each column independently? If the latter case, what criteria would be used to create rows from all of the unique sets of column values.

A hash file only has one row for each key value. Whenever you insert a row with a key value that already exists, the existing row is overwritten with the new row.

Can you supply a more simplified example? For instance, an example with 3 columns and 3 rows.

Mike
Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

Mike here is a better 3 column sample:
ArmConvOpt ArmIndexCode ArmNegAmrFlag
NONE 100
NONE 200
NONE 100
NONE AUCTION
NONE AUCTION

Needs to look like this:
ArmConvOpt ArmIndexCode ArmNegAmrFlag
NONE 100 AUCTION
200
I hope this makes more sense...
Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

The sample did no come up too well:
Here is a comma delimited one:
ArmConvOpt, ArmIndexCode, ArmNegAmrFlag
NONE, 100,
NONE, 200,
NONE, 100,
NONE,,AUCTION
NONE,,AUCTION

Needs to look like this:
ArmConvOpt, ArmIndexCode, ArmNegAmrFlag
NONE, 100, AUCTION
,200,
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Emilio,

Thanks. I think I understand your requirement now. I have a couple questions that will influence a job design:

1) How many columns are there?
2) Will the number of columns always be the same?
3) How many rows will typically be processed?

Mike
Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

There are 10 columns and probably about 10 to 100 rows....
Emilio
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Emilio,

In the aggreagtor what I meant by this "You can use the aggregator and the count > 1" was the below

For each key column you need to define a dummy column and group on the key column and use the count aggregate function to populate the dummy column. After you do that, you need to write them to a sequential file and filter the records based on constraint count>1 in a transformer and store those keys in a hash file and use them against the original input to eliminate the duplicates.

For achieving the same with stage variables, its kind of messy, especially since you want to do it on all columns.

The easiest option for you is Hash file option.
Thanks,
Naveen
Emilio
Participant
Posts: 17
Joined: Wed Jan 28, 2004 2:18 pm
Location: Frederick, MD

Post by Emilio »

Naveen,
When creating a hash file (I am a little familiar with this process), from the transformer I create the hash file and from the hash file, can I go back to another transformer or to a file?
Also, do I need to make all my columns key for the hash file?
Thanks :)
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Emilio,

If you want to remove duplicates in each column, then you need to make all of them as key columns, before you write them to a hash file. But, you need to keep in mind, that in this method follows destructive write and only the last one wins.

After you get all the unique keys into a hash file, you can use a tranformer in which, you will be reading the input and performing the lookup using the hash file generated in the previous step.

Code: Select all

 Input                  Output
---------> Transformer ------------>  
                 ^ 
                 !
                 !  Lookup
                 !
             Hash File
Thanks,
Naveen
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Emilio,

I can't believe the most interesting topic of the week was posted when I was asleep. Well now you Yanks are going home to bed and it's Australia's turn...

Here's my understanding of your requirement: given an input file with n columns, you want to produce an output file where no single value of any of the n columns is repeated.
What is not clear to me is whether the number of output rows is important. For example, would you accept the following as an alternative output for your example above?

Code: Select all

NONE, 100, 
, 200, 
,,AUCTION 
If so, then I think I have a solution. I have attached a routine below that I used for a different problem; I just wanted a generic method of removing duplicates. Here's what you do:
- Copy the code below and create a DS Routine called IsUnique with 4 parameters: TEMPDIR, HashFile, Initialize, and Val
- In your DS Job, create a stage variable for each output column with expressions of the form:

Code: Select all

If IsUnique("X:\xxx\yyy\temp", DSJOB : "<col_name>", Initialize, In_link.col) Then 
    In_link.col 
Else 
    @NULL
- Create another stage variable called Initialize BELOW the other stage variables. Set the default value to 1, and the expression to 0. This means that it will be 1 for the first row processed (causing the IsUnique hash file mechanism to be initialized) and then 0 for each subsequent row.
- Map each of the stage variables to their respective columns.

How does it work? For each column, a different hash file is created in a temp directory of your choosing. You pass a value into the routine (and tell it which hash file to look in) and it tells you whether it has seen the value before (success) or not (failure). If the routine returns success, then you send the column value to the output link, otherwise send @NULL.

A couple of notes:
- I wrote this for Unix only. There is a Unix command I have highlighted with stars that you will have to translate into DOS. It is basically this:

Code: Select all

CD <TEMPDIR>
DEL <TEMPDIR>\<HashFile>
%DSHOME%\bin\mkdbfile <HashFile> 30 1 4 20 50 80 1628
although in Unix I separated the commands in such a way that if one failed the next would not run.
- I'm a UV BASIC beginner. Hopefully some bigger brains than mine will critique this routine (nicely of course :) )so that I can improve it - listen to them. :!:
- I originally designed it without the ability handle NULLs in the Val param. I made a change in this version that should work but didn't test it.
- I had to remove some references to other custom routines in our library. It should still compile, but...

Good luck.

Code: Select all


      Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle


      EQUATE RoutineName TO 'IsUnique'

      Ans = 0
      HashPath = TEMPDIR : "\" : HashFile


      * If the Initialize parameter is set, remove and rebuild the hash file
      If Initialize Then

         Initialized = 0

         * The Unix command does 3 things (and fails immediately if any of them fail):
         * : Go to the TEMP Directory
         * : Remove the existing hash file if it exists
         * : Create a new hash file with mkdbfile
         * I had to use this method, because it would fail with "unable to create ..." error if I created it with
         * full path name from another directory. Dont know why.

****************************
         StrCommand = "cd " : TEMPDIR : " && rm -rf " : HashFile : " && $DSHOME/bin/mkdbfile " : HashFile : " 30 1 4 20 50 80 1628"
****************************

         Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
         if RetCode <> 0 then
            Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
         end

         * Attempt to open the file.
         Openpath HashPath TO UniqFileHandle
         Then
            Initialized = 1
            UniqFileName = HashPath
         End
         Else
            Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
            Ans = -1
         End
      End


      If Initialized Then

         * If the currently open file is a different file, then open the correct file

         If UniqFileName <> HashPath Then
            Openpath HashPath TO UniqFileHandle
            Then
               Initialized = 1
               UniqFileName = HashPath
            End
            Else
               Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
               Ans = -1
            End
         End



         If Ans = 0 And Not(IsNull(Val)) Then

            * Check if Val already exists
            Readu ValExists From UniqFileHandle, Val
            Then
               * Return FALSE if exists
               Ans = 0
            End
            Else
               * Return TRUE and add to file if not exists
               Writeu Val On UniqFileHandle, Val
               Then
                  Ans = 1
               End
               Else
                  Call DSLogWarn("Unable to write hash file " : HashFile, "IsUnique")
                  Ans = -1
               End
            End
         End
      End
[/code]
Ross Leishman
htrisakti3
Charter Member
Charter Member
Posts: 36
Joined: Thu Jun 10, 2004 11:22 pm

hash may not work?

Post by htrisakti3 »

I saw that from your example, you need to keep "the most complete combination" ??

ie:
NONE, 100, AUCTION -- this row is combination of 1,3,4,5
if you use hash (destructive write) i think you'll endup with
NONE, ,AUCTION

here's what I'd do:
(dunno how to make drawing based on ascii, so bare w/ me here)

1. use your file as input
2. transformer with hashfileA lookup
3. output to same hashfileA

in the lookup, you need to concatenate input file & hashfileA
ie:
input: NONE, 100, ''
+hash: NONE, '', AUCTION
needs to return: NONE, maxOf(100,'')=100, maxof('',AUCTION)=AUCTION

this way, the one written to hash will be the most complete combination once you get to the end of file
Post Reply