Removing duplicates from 20 million records

pavan31081980 · Post by **pavan31081980** » Tue Jul 18, 2006 3:43 am

sort the data in unix and then try loadin in datastage.It should work as 20 million records are minimal when u sort data in unix

balajisr · Post by **balajisr** » Tue Jul 18, 2006 3:56 am

pavan31081980 wrote:sort the data in unix and then try loadin in datastage.It should work as 20 million records are minimal when u sort data in unix

keerthi is in windows environment and not in unix.

It is important to specify the environment while posting.

m_keerthi2005 · Post by **m_keerthi2005** » Tue Jul 18, 2006 4:24 am

I have sorted the data before doing the aggregation. Here sorting is going through succesfully, but when it is doing group by on key columns in aggregator it is failing at 2 GB memory limit.

Can anybody help on this.

ArndW · Post by **ArndW** » Tue Jul 18, 2006 4:34 am

You have neglected to tell the aggregator stage that your incoming data is already sorted. Once you do so you will see the speed pick up and the stage will use almost no memory at all.

balajisr · Post by **balajisr** » Tue Jul 18, 2006 4:43 am

Go to Input -> Columns tab of aggregator stage. Locate sort column.Specify sort order of the already sorted incoming data e.g 1 or 2 or 3.

m_keerthi2005 · Post by **m_keerthi2005** » Tue Jul 18, 2006 5:23 am

Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.

balajisr · Post by **balajisr** » Tue Jul 18, 2006 5:37 am

m_keerthi2005 wrote:Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.

Read about "Max Rows in Virtual Memory" and "Max Open Files" in the manual and play around with it.

ramdev_srh · Post by **ramdev_srh** » Mon Jul 24, 2006 9:58 am

m_keerthi2005 wrote:Hi all,

We are facing one problem in removing duplicates. we have 2 files. Each file has 10 million records. When we remove duplicates using Aggregator stage on 3 key columns, we are getting limitation on Aggregation memory. The job is getting aborted after the memory reaches to 2 GB, i.e after 15 lakh records the job is getting rejected.

Could you please suggest any approch to resolve this issue.

Thanks in advance.

Hi,
If aggregator is nessary, sort the data first and do the process else
use the unix scripts to remove the duplicants else
go for hash file that would be a better choice. if u pass the values through the hash file stage u can remove duplicants as wall as increse the performence

chulett · Post by **chulett** » Mon Jul 24, 2006 10:27 am

Why come here to a post with two pages of discussion, only read the first post and then bascially say the same thing that's already been discussed in those two pages? What value does that add?

aartlett · Post by **aartlett** » Mon Jul 24, 2006 6:24 pm

m_keerthi2005 wrote:Thanks folks for helping on this issue.

One thing I oberved that the performance of the sort stage is slow. Is there any way to improve the performance of sorter stage.

External sort. In Windows get hold of the unix sort command from sourceforge where the unixutils are. these are great for augmenting what windows can do.

Here you can run a sort -u, a uniq or any of the methods described in this post.

A note: next time indicate you are on windows at the begining, not half way through the first page.

DSXchange

Removing duplicates from 20 million records

Re: Removing duplicates from 20 million records