Optimizing the performance of an Unduplicate Job

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Optimizing the performance of an Unduplicate Job

Post by DSFreddie »

Hi All,

I have a QS job with the following flow,

Input flows
------------
Frequency generated File (Auto Partitioning) - 0.4 million records
Actual Data File (already Partitoned based on key field A in the previous job, so SAME partitoning) - 1 million records

These 2 inputs goes to an UnDuplicate Stage (with 11 Match Passes)

The Unduplicate generates 3 files (Match/clerical/Residual)

This job takes roughly one hour to complete.

I am trying to finetune this job to minimise the runtime as there will be around 100 million records in production. Any thoughts will be much appreciated.
Also, We are running these jobs in GRID environment.

Thanks Much,
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

11 match passes is a lot of work. About the only thing I could suggest would be throwing more nodes at it. Hopefully key field A has plenty of distinct values to make that a feasible suggestion.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

That's a bit of time, even for 11 passes.
I've had similar complexity undups with around 15 million records take similar time (2 nodes, non-grid).

Presumably this sort of performance is very slow for your environment?

- 400k match frequency records seems like an awful lot. When you create the match frequency file, do you list the match spec? If not, you're including a lot more fields than you need to (and getting a bigger file).
-If you are including the spec, have you set some of the highly unique fields (TFN/SSN, etc) as NOFREQ in the Variable Special Handling?

It might mean that between the 2 things here, the engine is doing a lot more work to do the scoring.

I'd also be curious to know how much pre-partitioning the data actually does.
If you have multiple nodes at work, and each of the 11 passes uses different blocking fields, the data is going to have to be consolidated and re-partitioned after each pass anyway.
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Thanks Ray/Stuart for your inputs,

To answer the questions,
Yes, we are using the Match Spec while generating the frequency file.

I havent tried putting the unique filed as NOFREQ in the Variable Special Handling. I will try & let you know if it optimised the performance.

This job is running in 12,4 compute nodes & APT_GRID_PARTITION value is set to 1.

In the Unduplicate, it is Match Dependant Option.

Thanks,
Freddie
Post Reply