Removing duplicates from 20 million records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

pavan31081980
Participant
Posts: 17
Joined: Sun Mar 19, 2006 5:46 am
Location: vja

Post by pavan31081980 »

sort the data in unix and then try loadin in datastage.It should work as 20 million records are minimal when u sort data in unix
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

pavan31081980 wrote:sort the data in unix and then try loadin in datastage.It should work as 20 million records are minimal when u sort data in unix
keerthi is in windows environment and not in unix.

It is important to specify the environment while posting.
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

I have sorted the data before doing the aggregation. Here sorting is going through succesfully, but when it is doing group by on key columns in aggregator it is failing at 2 GB memory limit.

Can anybody help on this.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

You have neglected to tell the aggregator stage that your incoming data is already sorted. Once you do so you will see the speed pick up and the stage will use almost no memory at all.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Go to Input -> Columns tab of aggregator stage. Locate sort column.Specify sort order of the already sorted incoming data e.g 1 or 2 or 3.
m_keerthi2005
Participant
Posts: 22
Joined: Thu Jun 02, 2005 5:12 am

Post by m_keerthi2005 »

Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

m_keerthi2005 wrote:Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.
Read about "Max Rows in Virtual Memory" and "Max Open Files" in the manual and play around with it.
ramdev_srh
Participant
Posts: 16
Joined: Mon Jul 24, 2006 9:27 am

Re: Removing duplicates from 20 million records

Post by ramdev_srh »

m_keerthi2005 wrote:Hi all,

We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.

Could you please suggest any approch to resolve this issue.

Thanks in advance.
Hi,
If aggregator is nessary, sort the data first and do the process else
use the unix scripts to remove the duplicants else
go for hash file that would be a better choice. if u pass the values through the hash file stage u can remove duplicants as wall as increse the performence
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:? Why come here to a post with two pages of discussion, only read the first post and then bascially say the same thing that's already been discussed in those two pages? What value does that add?
-craig

"You can never have too many knives" -- Logan Nine Fingers
aartlett
Charter Member
Charter Member
Posts: 152
Joined: Fri Apr 23, 2004 6:44 pm
Location: Australia

Post by aartlett »

m_keerthi2005 wrote:Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.
External sort. In Windows get hold of the unix sort command from sourceforge where the unixutils are. these are great for augmenting what windows can do.

Here you can run a sort -u, a uniq or any of the methods described in this post.

A note: next time indicate you are on windows at the begining, not half way through the first page.
Andrew

Think outside the Datastage you work in.

There is no True Way, but there are true ways.
Post Reply