Page 1 of 1

How to handle ties in QualityStage Survive Stage

Posted: Wed Jul 25, 2012 3:02 pm
by gclimer
Product: Information Server 8.5 Fix Pack 1
OS: AIX 6.1.7.15 TL07

I'm trying to use the QualityStage survive stage to survive the most frequent (non-blank) gender for a person. It seems to be having trouble when there is a tie.

Here are four example records

Code: Select all

ID   ROW   DATE        NAME       GENDER
55   4     01/01/2006  JOHN DOE   M        
55   3     01/01/2008  JOHN DOE   F        
55   2     01/01/2010  JOHN DOE   F        
55   1     01/01/2012  JOHN DOE   M        
Since there is a tie, in this scenario I want it to select "M", since the most recent record is "M", but instead it chooses "F".

The records are sorted by DATE in descending order, and assigned a ROW number starting with 1, and incrementing by 1 in order of newest to oldest.

The data is hashed on ID and sorted on ID and ROW, so it's going into the Survive stage so the most recent record is the first into the Survive stage.

I've tried it with multiple runs using many different variations of these rules (in addition to others):

1. All columns - analyze ROW Equals 1
2. GENDER - Most Frequent (non-blank) survives

1. GENDER - Most Frequent (non-blank) survives
2. All columns - analyze ROW Equals 1

The weird thing is that it works for SOME of the records, but not ALL of the records. Fortunately it seems to be consistently wrong for the same people.

I'm not sure if I'm overlooking something extremely simple, or if this is an actual problem with the Survive stage.

Any tips or suggestions would be greatly appreciated.

Thanks!

Posted: Wed Jul 25, 2012 5:38 pm
by ray.wurlod
Welcome aboard. I would guess (without testing) that the tie with the record with the highest match weight would survive in this case.

Posted: Thu Jul 26, 2012 6:12 am
by gclimer
Perhaps that's where the problem lies, as I'm not using the QualityStage Match specification weights / match type in the survive stage at all.

Posted: Thu Jul 26, 2012 1:39 pm
by gclimer
I fear the Survive stage is not really designed to handle the situation I'm working with.

Due to an impending deadline, I scrapped the Survive stage and went with the more traditional DataStage stages to accomplish what I needed to do.

All in all it took 11 stages to do what I was hoping to accomplish with a single Survive stage.

I might be able to trim the fat a little bit, but since the job runs in ~7 minutes for ~7 million records, I'm happy with what I got.