Eliminating Duplicates in an Excel file

Bram · Post by **Bram** » Mon May 09, 2005 7:24 pm

Hi ,

I got a excel file , which has duplicates. My target file is also an excel file, with no duplicates in it. How can i do that, Basicall y iam trying to remove duplicates from source(excel file) and loaded into target(excel file)

Also when i click on target Sequential_File_8, view data, its giving me the following error:

Test..Sequential_File_8.DSLink9: DSD.SEQOpen Failed to open DSLink9
STATUS = -1.

Any help is greately apprecited,
Bram

chulett · Post by **chulett** » Mon May 09, 2005 7:57 pm

DSLink9? Sequential_File_8? You need to start renaming your default link and stage names to something more meaningful.

The name of the sequential file to write to (or read from) defaults to the link name. Right now it seems like yours wants to read from a file called DSLink9 and the error means it doesn't exist. If you haven't run your job at least once then the target file you are planning on creating does not yet exist.

Explain what you mean by 'excel file'. Do you mean literally a spreadsheet with a .xls extension that you pushed onto your UNIX server somehow? Or a .csv file created from Excel? And what are you trying to create? Yet another spreadsheet or a csv that can be automatically opened and processed by Excel? Not enough details to provide much concrete help quite yet.

And there's nothing magical about removing duplicates from this file over any other. You'd need to either sort it and check for group changes based on your sort keys or write to a hash and let the natural destructive overwrite a hash brings to the table do the work for you. There are other ways, I'm sure - those are the top two.

Sunshine2323 · Post by **Sunshine2323** » Tue May 10, 2005 3:08 am

Hi,

There are many ways duplicates can be removed,

1)You can pass the rows through a Aggregatorand Group on the keys

2)You can read and write a the same hash file and check for constraints IsNull(Key) in the transformer

--------------------Hash File

Input File -------Transformer-----------Output File

--------------------Hash File

3) Sort on the key and then use the transform RowProcCompareWithPreviousValue to compare with the previous key value and decide whether it is a Duplicate or not

Hope this helps.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue May 10, 2005 3:11 am

As you are using DS in Unix, how do you intend to bring you Excel file? If you bring it as a csv, you can use a sort -u (with other options necessary) to make a unique list.

Prashantoncyber · Post by **Prashantoncyber** » Tue May 10, 2005 5:35 am

I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.

Sunshine2323 · Post by **Sunshine2323** » Tue May 10, 2005 5:56 am

Nice way of earning points without any strong inputs.

Prashantoncyber · Post by **Prashantoncyber** » Tue May 10, 2005 8:41 am

I leave this matter to senior Participants of this group for commets.....

Bram · Post by **Bram** » Tue May 10, 2005 1:11 pm

Hi Sainath,

I am tring to remove duplicates, using Sort -u option, but its not any idea how to do that at unix level?

Thanks,
Ram

chulett · Post by **chulett** » Tue May 10, 2005 1:17 pm

In the Filter command of the sourcing Sequential file stage. Note that doing it this way will only remove 'duplicates' where the entire row is duplicated. If that's the case here, then that is pretty simple to set up.

Bram · Post by **Bram** » Tue May 10, 2005 4:44 pm

Hi Chulett,

Can you please more elaborate on it?

Thanks,
Ram

Sunshine2323 · Post by **Sunshine2323** » Wed May 11, 2005 1:14 am

Hi Ram,

If you use Sort -u then only those records will be considered duplicates in which values for all the columns are duplicated. You can check for the details of this command by doing man sort

For Example if your input file looks like this

1,2,3,4
1,1,1,1
4,4,4,4
1,1,1,1

then a

sort -u InputFile ------->will give

1,2,3,4
4,4,4,4
1,1,1,1

If this is your requirement then you can use this command in the filter command box on the Sequential File Stage.

You can enable the Filter Command box by checking Stage uses Filter Commands check box in the General Tab of the Sequential File Stage.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed May 11, 2005 3:37 am

You can also use uniq to remove duplicates based on specified cols.

alraaayeq · Post by **alraaayeq** » Tue Jul 19, 2005 11:33 pm

Prashantoncyber wrote:I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.

Did you mentioned performance ! <--- this is my baby

well well, we got more than one option here, but which one is the highest performance? <-- this what I am looking for

I believe such performance issues need try&error according to the records and files sizes, so we could share the knowledge here.

I got a similar job (bid daddy job

) that costs us 22 Hours CPU processing/utilization per day (22 minutes per one million record), then a friend of my used some magics and it turned to be 7 HRs. this topic was a bit concerning us, 4 hours will be the target record for us.

can any body help me in this?
Note: we are using hash stages to get rid of duplicated records,and these records should be logged somewhere, the duplication ratio is 2% of the XYZ millions records.

vmcburney · Post by **vmcburney** » Tue Jul 19, 2005 11:59 pm

The transformer stage method with stage variables is the fastest method if you can get your input data sorted quickly. It will also let you output discarded duplicates quite easily. See the remove duplicates FAQ.