Optimizing the performance of an Unduplicate Job

DSFreddie · Post by **DSFreddie** » Thu Feb 16, 2012 2:36 pm

Hi All,

I have a QS job with the following flow,

Input flows
------------
Frequency generated File (Auto Partitioning) - 0.4 million records
Actual Data File (already Partitoned based on key field A in the previous job, so SAME partitoning) - 1 million records

These 2 inputs goes to an UnDuplicate Stage (with 11 Match Passes)

The Unduplicate generates 3 files (Match/clerical/Residual)

This job takes roughly one hour to complete.

I am trying to finetune this job to minimise the runtime as there will be around 100 million records in production. Any thoughts will be much appreciated.
Also, We are running these jobs in GRID environment.

Thanks Much,
Freddie

ray.wurlod · Post by **ray.wurlod** » Thu Feb 16, 2012 3:00 pm

11 match passes is a lot of work. About the only thing I could suggest would be throwing more nodes at it. Hopefully key field A has plenty of distinct values to make that a feasible suggestion.

stuartjvnorton · Post by **stuartjvnorton** » Thu Feb 16, 2012 5:29 pm

That's a bit of time, even for 11 passes.
I've had similar complexity undups with around 15 million records take similar time (2 nodes, non-grid).

Presumably this sort of performance is very slow for your environment?

- 400k match frequency records seems like an awful lot. When you create the match frequency file, do you list the match spec? If not, you're including a lot more fields than you need to (and getting a bigger file).
-If you are including the spec, have you set some of the highly unique fields (TFN/SSN, etc) as NOFREQ in the Variable Special Handling?

It might mean that between the 2 things here, the engine is doing a lot more work to do the scoring.

I'd also be curious to know how much pre-partitioning the data actually does.
If you have multiple nodes at work, and each of the 11 passes uses different blocking fields, the data is going to have to be consolidated and re-partitioned after each pass anyway.

DSFreddie · Post by **DSFreddie** » Fri Feb 17, 2012 8:36 am

Thanks Ray/Stuart for your inputs,

To answer the questions,
Yes, we are using the Match Spec while generating the frequency file.

I havent tried putting the unique filed as NOFREQ in the Variable Special Handling. I will try & let you know if it optimised the performance.

This job is running in 12,4 compute nodes & APT_GRID_PARTITION value is set to 1.

In the Unduplicate, it is Match Dependant Option.

Thanks,
Freddie