Eliminating Duplicates in an Excel file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Bram
Participant
Posts: 35
Joined: Mon Dec 27, 2004 11:40 am

Eliminating Duplicates in an Excel file

Post by Bram »

Hi ,

I got a excel file , which has duplicates. My target file is also an excel file, with no duplicates in it. How can i do that, Basicall y iam trying to remove duplicates from source(excel file) and loaded into target(excel file)

Also when i click on target Sequential_File_8, view data, its giving me the following error:

Test..Sequential_File_8.DSLink9: DSD.SEQOpen Failed to open DSLink9
STATUS = -1.

Any help is greately apprecited,
Bram
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

DSLink9? Sequential_File_8? You need to start renaming your default link and stage names to something more meaningful. :wink:

The name of the sequential file to write to (or read from) defaults to the link name. Right now it seems like yours wants to read from a file called DSLink9 and the error means it doesn't exist. If you haven't run your job at least once then the target file you are planning on creating does not yet exist.

Explain what you mean by 'excel file'. Do you mean literally a spreadsheet with a .xls extension that you pushed onto your UNIX server somehow? Or a .csv file created from Excel? And what are you trying to create? Yet another spreadsheet or a csv that can be automatically opened and processed by Excel? Not enough details to provide much concrete help quite yet.

And there's nothing magical about removing duplicates from this file over any other. You'd need to either sort it and check for group changes based on your sort keys or write to a hash and let the natural destructive overwrite a hash brings to the table do the work for you. There are other ways, I'm sure - those are the top two.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Sunshine2323
Charter Member
Charter Member
Posts: 130
Joined: Mon Sep 06, 2004 3:05 am
Location: Dubai,UAE

Post by Sunshine2323 »

Hi,

There are many ways duplicates can be removed,

1)You can pass the rows through a Aggregatorand Group on the keys

2)You can read and write a the same hash file and check for constraints IsNull(Key) in the transformer

--------------------Hash File

Input File -------Transformer-----------Output File

--------------------Hash File

3) Sort on the key and then use the transform RowProcCompareWithPreviousValue to compare with the previous key value and decide whether it is a Duplicate or not

Hope this helps.
Warm Regards,
Amruta Bandekar

<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

As you are using DS in Unix, how do you intend to bring you Excel file? If you bring it as a csv, you can use a sort -u (with other options necessary) to make a unique list.
Prashantoncyber
Participant
Posts: 108
Joined: Wed Jul 28, 2004 7:15 am

Post by Prashantoncyber »

I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.
Sunshine2323
Charter Member
Charter Member
Posts: 130
Joined: Mon Sep 06, 2004 3:05 am
Location: Dubai,UAE

Post by Sunshine2323 »

Nice way of earning points without any strong inputs. :P
Warm Regards,
Amruta Bandekar

<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Prashantoncyber
Participant
Posts: 108
Joined: Wed Jul 28, 2004 7:15 am

Post by Prashantoncyber »

I leave this matter to senior Participants of this group for commets..... :shock:
Bram
Participant
Posts: 35
Joined: Mon Dec 27, 2004 11:40 am

Post by Bram »

Hi Sainath,

I am tring to remove duplicates, using Sort -u option, but its not any idea how to do that at unix level?

Thanks,
Ram
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In the Filter command of the sourcing Sequential file stage. Note that doing it this way will only remove 'duplicates' where the entire row is duplicated. If that's the case here, then that is pretty simple to set up.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Bram
Participant
Posts: 35
Joined: Mon Dec 27, 2004 11:40 am

Post by Bram »

Hi Chulett,

Can you please more elaborate on it?

Thanks,
Ram
Sunshine2323
Charter Member
Charter Member
Posts: 130
Joined: Mon Sep 06, 2004 3:05 am
Location: Dubai,UAE

Post by Sunshine2323 »

Hi Ram,

If you use Sort -u then only those records will be considered duplicates in which values for all the columns are duplicated. You can check for the details of this command by doing man sort

For Example if your input file looks like this

1,2,3,4
1,1,1,1
4,4,4,4
1,1,1,1

then a

sort -u InputFile ------->will give

1,2,3,4
4,4,4,4
1,1,1,1

If this is your requirement then you can use this command in the filter command box on the Sequential File Stage.

You can enable the Filter Command box by checking Stage uses Filter Commands check box in the General Tab of the Sequential File Stage.
Warm Regards,
Amruta Bandekar

<b>If A equals success, then the formula is: A = X + Y + Z, X is work. Y is play. Z is keep your mouth shut. </b>
--Albert Einstein
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can also use uniq to remove duplicates based on specified cols.
alraaayeq
Participant
Posts: 35
Joined: Sun Apr 04, 2004 5:57 am
Location: Riyadh,Saudi Arabia

Post by alraaayeq »

Prashantoncyber wrote:I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.
Did you mentioned performance ! <--- this is my baby



well well, we got more than one option here, but which one is the highest performance? <-- this what I am looking for


I believe such performance issues need try&error according to the records and files sizes, so we could share the knowledge here.



I got a similar job (bid daddy job 8) ) that costs us 22 Hours CPU processing/utilization per day (22 minutes per one million record), then a friend of my used some magics and it turned to be 7 HRs. this topic was a bit concerning us, 4 hours will be the target record for us.


can any body help me in this?
Note: we are using hash stages to get rid of duplicated records,and these records should be logged somewhere, the duplication ratio is 2% of the XYZ millions records.

:?:
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The transformer stage method with stage variables is the fastest method if you can get your input data sorted quickly. It will also let you output discarded duplicates quite easily. See the remove duplicates FAQ.
Post Reply