Eliminating Duplicates in an Excel file
Moderators: chulett, rschirm, roy
Eliminating Duplicates in an Excel file
Hi ,
I got a excel file , which has duplicates. My target file is also an excel file, with no duplicates in it. How can i do that, Basicall y iam trying to remove duplicates from source(excel file) and loaded into target(excel file)
Also when i click on target Sequential_File_8, view data, its giving me the following error:
Test..Sequential_File_8.DSLink9: DSD.SEQOpen Failed to open DSLink9
STATUS = -1.
Any help is greately apprecited,
Bram
I got a excel file , which has duplicates. My target file is also an excel file, with no duplicates in it. How can i do that, Basicall y iam trying to remove duplicates from source(excel file) and loaded into target(excel file)
Also when i click on target Sequential_File_8, view data, its giving me the following error:
Test..Sequential_File_8.DSLink9: DSD.SEQOpen Failed to open DSLink9
STATUS = -1.
Any help is greately apprecited,
Bram
DSLink9? Sequential_File_8? You need to start renaming your default link and stage names to something more meaningful.
The name of the sequential file to write to (or read from) defaults to the link name. Right now it seems like yours wants to read from a file called DSLink9 and the error means it doesn't exist. If you haven't run your job at least once then the target file you are planning on creating does not yet exist.
Explain what you mean by 'excel file'. Do you mean literally a spreadsheet with a .xls extension that you pushed onto your UNIX server somehow? Or a .csv file created from Excel? And what are you trying to create? Yet another spreadsheet or a csv that can be automatically opened and processed by Excel? Not enough details to provide much concrete help quite yet.
And there's nothing magical about removing duplicates from this file over any other. You'd need to either sort it and check for group changes based on your sort keys or write to a hash and let the natural destructive overwrite a hash brings to the table do the work for you. There are other ways, I'm sure - those are the top two.
The name of the sequential file to write to (or read from) defaults to the link name. Right now it seems like yours wants to read from a file called DSLink9 and the error means it doesn't exist. If you haven't run your job at least once then the target file you are planning on creating does not yet exist.
Explain what you mean by 'excel file'. Do you mean literally a spreadsheet with a .xls extension that you pushed onto your UNIX server somehow? Or a .csv file created from Excel? And what are you trying to create? Yet another spreadsheet or a csv that can be automatically opened and processed by Excel? Not enough details to provide much concrete help quite yet.
And there's nothing magical about removing duplicates from this file over any other. You'd need to either sort it and check for group changes based on your sort keys or write to a hash and let the natural destructive overwrite a hash brings to the table do the work for you. There are other ways, I'm sure - those are the top two.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Charter Member
- Posts: 130
- Joined: Mon Sep 06, 2004 3:05 am
- Location: Dubai,UAE
Hi,
There are many ways duplicates can be removed,
1)You can pass the rows through a Aggregatorand Group on the keys
2)You can read and write a the same hash file and check for constraints IsNull(Key) in the transformer
--------------------Hash File
Input File -------Transformer-----------Output File
--------------------Hash File
3) Sort on the key and then use the transform RowProcCompareWithPreviousValue to compare with the previous key value and decide whether it is a Duplicate or not
Hope this helps.
There are many ways duplicates can be removed,
1)You can pass the rows through a Aggregatorand Group on the keys
2)You can read and write a the same hash file and check for constraints IsNull(Key) in the transformer
--------------------Hash File
Input File -------Transformer-----------Output File
--------------------Hash File
3) Sort on the key and then use the transform RowProcCompareWithPreviousValue to compare with the previous key value and decide whether it is a Duplicate or not
Hope this helps.
Warm Regards,
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
-
- Participant
- Posts: 108
- Joined: Wed Jul 28, 2004 7:15 am
-
- Charter Member
- Posts: 130
- Joined: Mon Sep 06, 2004 3:05 am
- Location: Dubai,UAE
-
- Participant
- Posts: 108
- Joined: Wed Jul 28, 2004 7:15 am
-
- Charter Member
- Posts: 130
- Joined: Mon Sep 06, 2004 3:05 am
- Location: Dubai,UAE
Hi Ram,
If you use Sort -u then only those records will be considered duplicates in which values for all the columns are duplicated. You can check for the details of this command by doing man sort
For Example if your input file looks like this
1,2,3,4
1,1,1,1
4,4,4,4
1,1,1,1
then a
sort -u InputFile ------->will give
1,2,3,4
4,4,4,4
1,1,1,1
If this is your requirement then you can use this command in the filter command box on the Sequential File Stage.
You can enable the Filter Command box by checking Stage uses Filter Commands check box in the General Tab of the Sequential File Stage.
If you use Sort -u then only those records will be considered duplicates in which values for all the columns are duplicated. You can check for the details of this command by doing man sort
For Example if your input file looks like this
1,2,3,4
1,1,1,1
4,4,4,4
1,1,1,1
then a
sort -u InputFile ------->will give
1,2,3,4
4,4,4,4
1,1,1,1
If this is your requirement then you can use this command in the filter command box on the Sequential File Stage.
You can enable the Filter Command box by checking Stage uses Filter Commands check box in the General Tab of the Sequential File Stage.
Warm Regards,
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Amruta Bandekar
<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
Did you mentioned performance ! <--- this is my babyPrashantoncyber wrote:I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.
well well, we got more than one option here, but which one is the highest performance? <-- this what I am looking for
I believe such performance issues need try&error according to the records and files sizes, so we could share the knowledge here.
I got a similar job (bid daddy job ) that costs us 22 Hours CPU processing/utilization per day (22 minutes per one million record), then a friend of my used some magics and it turned to be 7 HRs. this topic was a bit concerning us, 4 hours will be the target record for us.
can any body help me in this?
Note: we are using hash stages to get rid of duplicated records,and these records should be logged somewhere, the duplication ratio is 2% of the XYZ millions records.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
The transformer stage method with stage variables is the fastest method if you can get your input data sorted quickly. It will also let you output discarded duplicates quite easily. See the remove duplicates FAQ.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn