Duplicates/Sorting question
Moderators: chulett, rschirm, roy
Duplicates/Sorting question
Hello,
How can I get rid of duplicates on columns using datastage?
My file looks like this:
ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
NONE,100,,N/A,,,,,AUCTION,0
NONE,0,,N/A,,,,,0,0
NONE,0,,0,,,,,FAM,0
NONE,0,,0,,,,,FAM,NO INCOME/ FULL ASSET
N/A,ERROR,,0,,,,,0,XX
N/A,ERROR,,0,,,,,0,0
Needs to look like this:
ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
N/A,ERROR,,,,,,,FAM,NO INCOME/ FULL ASSET
,,,,,,,,,100
,,,,,,,,,XX
the row order does not matter, just need to get rid of duplicates in each column.
Your help is much apprechated!!
Thanks
How can I get rid of duplicates on columns using datastage?
My file looks like this:
ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
NONE,100,,N/A,,,,,AUCTION,0
NONE,0,,N/A,,,,,0,0
NONE,0,,0,,,,,FAM,0
NONE,0,,0,,,,,FAM,NO INCOME/ FULL ASSET
N/A,ERROR,,0,,,,,0,XX
N/A,ERROR,,0,,,,,0,0
Needs to look like this:
ArmConvOpt,ArmIndexCode,ArmNegAmrFlag,ArmRoundFlag,BorrSelfEmpFlag,BorrSex,CoBorrSex,DocumentCode,DwellTypeCode,LimitDocFlag
NONE,100,,N/A,,,,,AUCTION,WHO KNOWS
N/A,ERROR,,,,,,,FAM,NO INCOME/ FULL ASSET
,,,,,,,,,100
,,,,,,,,,XX
the row order does not matter, just need to get rid of duplicates in each column.
Your help is much apprechated!!
Thanks
Hi Emelio,
I didn't exactly follow your sample data. You can do a search on the forum to find lots of info on this topic.
Couple of options:
1) Pass through a hash file taking advantage of its destructive overwrite behavior (last in wins)
2) Use the aggregator stage
If the data is sorted by key:
3) Use stage variables to test for a key change combined with output constraints
If the data volume is very large, you will probably want to feed an aggregator stage with sorted data.
I personally like option 1.
Mike
I didn't exactly follow your sample data. You can do a search on the forum to find lots of info on this topic.
Couple of options:
1) Pass through a hash file taking advantage of its destructive overwrite behavior (last in wins)
2) Use the aggregator stage
If the data is sorted by key:
3) Use stage variables to test for a key change combined with output constraints
If the data volume is very large, you will probably want to feed an aggregator stage with sorted data.
I personally like option 1.
Mike
-
- Participant
- Posts: 232
- Joined: Sat May 07, 2005 2:49 pm
- Location: USA
Hi Emilio,
There are several methods to remove the duplicates. Below are some sample approaches
1) You can define all the columns that you want to remove duplicates from as key columns and write them into a hash file.
2) You can use the aggregator and the count > 1 to filter duplicates
3) You can also use stage variables to achieve this
This topic has been discussed several times in this forum. Please use the search facility for more details.
There are several methods to remove the duplicates. Below are some sample approaches
1) You can define all the columns that you want to remove duplicates from as key columns and write them into a hash file.
2) You can use the aggregator and the count > 1 to filter duplicates
3) You can also use stage variables to achieve this
This topic has been discussed several times in this forum. Please use the search facility for more details.
Thanks,
Naveen
Naveen
Emilio,
are you looking to remove duplicates on just the first column, that is what your sample seems to be doing. In that case if your data is sorted then you can do it with a stage variable. The prerequisite of doing this in a normal stage is that the data is sorted by the columns you want to use as your criteria.
are you looking to remove duplicates on just the first column, that is what your sample seems to be doing. In that case if your data is sorted then you can do it with a stage variable. The prerequisite of doing this in a normal stage is that the data is sorted by the columns you want to use as your criteria.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Thank you for your replies:
I have some questions:
Naveen:
1) I was trying the aggregator to do the count >1. Where do you enter the >1 criteria in the aggregator?
2) How do you use the stage variables? explain this a little more.
Mile:
If this does not work I would like to try the hush file method...Do you create the hash file and bring it back and that eliminates duplicates? wow!!...that would be great.
Arnd:
I want to remove duplicates in all the column. the oder afterwards it does not matter.
Thanks
I have some questions:
Naveen:
1) I was trying the aggregator to do the count >1. Where do you enter the >1 criteria in the aggregator?
2) How do you use the stage variables? explain this a little more.
Mile:
If this does not work I would like to try the hush file method...Do you create the hash file and bring it back and that eliminates duplicates? wow!!...that would be great.
Arnd:
I want to remove duplicates in all the column. the oder afterwards it does not matter.
Thanks
Emilio,
I don't think I quite understand your requirement. Are you saying that all of the columns comprise the key? Or are you saying that you want to remove duplicates from each column independently? If the latter case, what criteria would be used to create rows from all of the unique sets of column values.
A hash file only has one row for each key value. Whenever you insert a row with a key value that already exists, the existing row is overwritten with the new row.
Can you supply a more simplified example? For instance, an example with 3 columns and 3 rows.
Mike
I don't think I quite understand your requirement. Are you saying that all of the columns comprise the key? Or are you saying that you want to remove duplicates from each column independently? If the latter case, what criteria would be used to create rows from all of the unique sets of column values.
A hash file only has one row for each key value. Whenever you insert a row with a key value that already exists, the existing row is overwritten with the new row.
Can you supply a more simplified example? For instance, an example with 3 columns and 3 rows.
Mike
-
- Participant
- Posts: 232
- Joined: Sat May 07, 2005 2:49 pm
- Location: USA
Hi Emilio,
In the aggreagtor what I meant by this "You can use the aggregator and the count > 1" was the below
For each key column you need to define a dummy column and group on the key column and use the count aggregate function to populate the dummy column. After you do that, you need to write them to a sequential file and filter the records based on constraint count>1 in a transformer and store those keys in a hash file and use them against the original input to eliminate the duplicates.
For achieving the same with stage variables, its kind of messy, especially since you want to do it on all columns.
The easiest option for you is Hash file option.
In the aggreagtor what I meant by this "You can use the aggregator and the count > 1" was the below
For each key column you need to define a dummy column and group on the key column and use the count aggregate function to populate the dummy column. After you do that, you need to write them to a sequential file and filter the records based on constraint count>1 in a transformer and store those keys in a hash file and use them against the original input to eliminate the duplicates.
For achieving the same with stage variables, its kind of messy, especially since you want to do it on all columns.
The easiest option for you is Hash file option.
Thanks,
Naveen
Naveen
-
- Participant
- Posts: 232
- Joined: Sat May 07, 2005 2:49 pm
- Location: USA
Hi Emilio,
If you want to remove duplicates in each column, then you need to make all of them as key columns, before you write them to a hash file. But, you need to keep in mind, that in this method follows destructive write and only the last one wins.
After you get all the unique keys into a hash file, you can use a tranformer in which, you will be reading the input and performing the lookup using the hash file generated in the previous step.
If you want to remove duplicates in each column, then you need to make all of them as key columns, before you write them to a hash file. But, you need to keep in mind, that in this method follows destructive write and only the last one wins.
After you get all the unique keys into a hash file, you can use a tranformer in which, you will be reading the input and performing the lookup using the hash file generated in the previous step.
Code: Select all
Input Output
---------> Transformer ------------>
^
!
! Lookup
!
Hash File
Thanks,
Naveen
Naveen
-
- Premium Member
- Posts: 252
- Joined: Mon Sep 19, 2005 10:28 pm
- Location: Melbourne, Australia
- Contact:
Emilio,
I can't believe the most interesting topic of the week was posted when I was asleep. Well now you Yanks are going home to bed and it's Australia's turn...
Here's my understanding of your requirement: given an input file with n columns, you want to produce an output file where no single value of any of the n columns is repeated.
What is not clear to me is whether the number of output rows is important. For example, would you accept the following as an alternative output for your example above?
If so, then I think I have a solution. I have attached a routine below that I used for a different problem; I just wanted a generic method of removing duplicates. Here's what you do:
- Copy the code below and create a DS Routine called IsUnique with 4 parameters: TEMPDIR, HashFile, Initialize, and Val
- In your DS Job, create a stage variable for each output column with expressions of the form:
- Create another stage variable called Initialize BELOW the other stage variables. Set the default value to 1, and the expression to 0. This means that it will be 1 for the first row processed (causing the IsUnique hash file mechanism to be initialized) and then 0 for each subsequent row.
- Map each of the stage variables to their respective columns.
How does it work? For each column, a different hash file is created in a temp directory of your choosing. You pass a value into the routine (and tell it which hash file to look in) and it tells you whether it has seen the value before (success) or not (failure). If the routine returns success, then you send the column value to the output link, otherwise send @NULL.
A couple of notes:
- I wrote this for Unix only. There is a Unix command I have highlighted with stars that you will have to translate into DOS. It is basically this:
although in Unix I separated the commands in such a way that if one failed the next would not run.
- I'm a UV BASIC beginner. Hopefully some bigger brains than mine will critique this routine (nicely of course )so that I can improve it - listen to them.
- I originally designed it without the ability handle NULLs in the Val param. I made a change in this version that should work but didn't test it.
- I had to remove some references to other custom routines in our library. It should still compile, but...
Good luck.
[/code]
I can't believe the most interesting topic of the week was posted when I was asleep. Well now you Yanks are going home to bed and it's Australia's turn...
Here's my understanding of your requirement: given an input file with n columns, you want to produce an output file where no single value of any of the n columns is repeated.
What is not clear to me is whether the number of output rows is important. For example, would you accept the following as an alternative output for your example above?
Code: Select all
NONE, 100,
, 200,
,,AUCTION
- Copy the code below and create a DS Routine called IsUnique with 4 parameters: TEMPDIR, HashFile, Initialize, and Val
- In your DS Job, create a stage variable for each output column with expressions of the form:
Code: Select all
If IsUnique("X:\xxx\yyy\temp", DSJOB : "<col_name>", Initialize, In_link.col) Then
In_link.col
Else
@NULL
- Map each of the stage variables to their respective columns.
How does it work? For each column, a different hash file is created in a temp directory of your choosing. You pass a value into the routine (and tell it which hash file to look in) and it tells you whether it has seen the value before (success) or not (failure). If the routine returns success, then you send the column value to the output link, otherwise send @NULL.
A couple of notes:
- I wrote this for Unix only. There is a Unix command I have highlighted with stars that you will have to translate into DOS. It is basically this:
Code: Select all
CD <TEMPDIR>
DEL <TEMPDIR>\<HashFile>
%DSHOME%\bin\mkdbfile <HashFile> 30 1 4 20 50 80 1628
- I'm a UV BASIC beginner. Hopefully some bigger brains than mine will critique this routine (nicely of course )so that I can improve it - listen to them.
- I originally designed it without the ability handle NULLs in the Val param. I made a change in this version that should work but didn't test it.
- I had to remove some references to other custom routines in our library. It should still compile, but...
Good luck.
Code: Select all
Common /CDMUniq/ Initialized, UniqFileName, UniqFileHandle
EQUATE RoutineName TO 'IsUnique'
Ans = 0
HashPath = TEMPDIR : "\" : HashFile
* If the Initialize parameter is set, remove and rebuild the hash file
If Initialize Then
Initialized = 0
* The Unix command does 3 things (and fails immediately if any of them fail):
* : Go to the TEMP Directory
* : Remove the existing hash file if it exists
* : Create a new hash file with mkdbfile
* I had to use this method, because it would fail with "unable to create ..." error if I created it with
* full path name from another directory. Dont know why.
****************************
StrCommand = "cd " : TEMPDIR : " && rm -rf " : HashFile : " && $DSHOME/bin/mkdbfile " : HashFile : " 30 1 4 20 50 80 1628"
****************************
Call DSExecute("UNIX", StrCommand , OutPut, RetCode)
if RetCode <> 0 then
Call DSLogWarn("Unix error (" : RetCode : ") in ": StrCommand : " ::::: " : OutPut, "IsUnique")
end
* Attempt to open the file.
Openpath HashPath TO UniqFileHandle
Then
Initialized = 1
UniqFileName = HashPath
End
Else
Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
Ans = -1
End
End
If Initialized Then
* If the currently open file is a different file, then open the correct file
If UniqFileName <> HashPath Then
Openpath HashPath TO UniqFileHandle
Then
Initialized = 1
UniqFileName = HashPath
End
Else
Call DSLogWarn("Unable to open hash file " : HashPath, "IsUnique")
Ans = -1
End
End
If Ans = 0 And Not(IsNull(Val)) Then
* Check if Val already exists
Readu ValExists From UniqFileHandle, Val
Then
* Return FALSE if exists
Ans = 0
End
Else
* Return TRUE and add to file if not exists
Writeu Val On UniqFileHandle, Val
Then
Ans = 1
End
Else
Call DSLogWarn("Unable to write hash file " : HashFile, "IsUnique")
Ans = -1
End
End
End
End
Ross Leishman
-
- Charter Member
- Posts: 36
- Joined: Thu Jun 10, 2004 11:22 pm
hash may not work?
I saw that from your example, you need to keep "the most complete combination" ??
ie:
NONE, 100, AUCTION -- this row is combination of 1,3,4,5
if you use hash (destructive write) i think you'll endup with
NONE, ,AUCTION
here's what I'd do:
(dunno how to make drawing based on ascii, so bare w/ me here)
1. use your file as input
2. transformer with hashfileA lookup
3. output to same hashfileA
in the lookup, you need to concatenate input file & hashfileA
ie:
input: NONE, 100, ''
+hash: NONE, '', AUCTION
needs to return: NONE, maxOf(100,'')=100, maxof('',AUCTION)=AUCTION
this way, the one written to hash will be the most complete combination once you get to the end of file
ie:
NONE, 100, AUCTION -- this row is combination of 1,3,4,5
if you use hash (destructive write) i think you'll endup with
NONE, ,AUCTION
here's what I'd do:
(dunno how to make drawing based on ascii, so bare w/ me here)
1. use your file as input
2. transformer with hashfileA lookup
3. output to same hashfileA
in the lookup, you need to concatenate input file & hashfileA
ie:
input: NONE, 100, ''
+hash: NONE, '', AUCTION
needs to return: NONE, maxOf(100,'')=100, maxof('',AUCTION)=AUCTION
this way, the one written to hash will be the most complete combination once you get to the end of file