How to handle ties in QualityStage Survive Stage
Posted: Wed Jul 25, 2012 3:02 pm
Product: Information Server 8.5 Fix Pack 1
OS: AIX 6.1.7.15 TL07
I'm trying to use the QualityStage survive stage to survive the most frequent (non-blank) gender for a person. It seems to be having trouble when there is a tie.
Here are four example records
Since there is a tie, in this scenario I want it to select "M", since the most recent record is "M", but instead it chooses "F".
The records are sorted by DATE in descending order, and assigned a ROW number starting with 1, and incrementing by 1 in order of newest to oldest.
The data is hashed on ID and sorted on ID and ROW, so it's going into the Survive stage so the most recent record is the first into the Survive stage.
I've tried it with multiple runs using many different variations of these rules (in addition to others):
1. All columns - analyze ROW Equals 1
2. GENDER - Most Frequent (non-blank) survives
1. GENDER - Most Frequent (non-blank) survives
2. All columns - analyze ROW Equals 1
The weird thing is that it works for SOME of the records, but not ALL of the records. Fortunately it seems to be consistently wrong for the same people.
I'm not sure if I'm overlooking something extremely simple, or if this is an actual problem with the Survive stage.
Any tips or suggestions would be greatly appreciated.
Thanks!
OS: AIX 6.1.7.15 TL07
I'm trying to use the QualityStage survive stage to survive the most frequent (non-blank) gender for a person. It seems to be having trouble when there is a tie.
Here are four example records
Code: Select all
ID ROW DATE NAME GENDER
55 4 01/01/2006 JOHN DOE M
55 3 01/01/2008 JOHN DOE F
55 2 01/01/2010 JOHN DOE F
55 1 01/01/2012 JOHN DOE M The records are sorted by DATE in descending order, and assigned a ROW number starting with 1, and incrementing by 1 in order of newest to oldest.
The data is hashed on ID and sorted on ID and ROW, so it's going into the Survive stage so the most recent record is the first into the Survive stage.
I've tried it with multiple runs using many different variations of these rules (in addition to others):
1. All columns - analyze ROW Equals 1
2. GENDER - Most Frequent (non-blank) survives
1. GENDER - Most Frequent (non-blank) survives
2. All columns - analyze ROW Equals 1
The weird thing is that it works for SOME of the records, but not ALL of the records. Fortunately it seems to be consistently wrong for the same people.
I'm not sure if I'm overlooking something extremely simple, or if this is an actual problem with the Survive stage.
Any tips or suggestions would be greatly appreciated.
Thanks!