Data Matching: QualityStage vs Oracle’s Soundex

This forum is in support of all issues about Data Quality regarding DataStage and other strategies.

Moderators: chulett, rschirm

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

QualityStage does have Soundex, but you are not using it as your blocking columns. Did you intend to? Soundex is Soundex, no matter which vendor it comes from; the algorithm is in the public domain.

NYSIIS is a more sophisticated algorithm than Soundex (primarily because it looks at more characters and does not lose the vowels), and could not reasonably be expected to generate the same results as Soundex - otherwise why bother having two algorithms at all?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
marionpt
Participant
Posts: 4
Joined: Mon Sep 29, 2008 12:41 am
Location: Manila, Philippines
Contact:

Post by marionpt »

Hi,

In our version (8.0.1) there is no Soundex, we only have Reverse Soundex (RVSNDX) and NYSIIS. (I will ask the IBM support regarding this).

I didn't use that because there are Last Name ends with different letter e.g. "Fernando Gonzales" should match with "Fernando Gonzalez"

I use QS to get the most accurate results. And I use Soundex for testing purposes only. Soundex result sometimes not that accurate e.g. "John Garcia" match with "Jane Garso".

In my understanding, the "Residual" are the unique records. But when I tested the result (using the Soundex), I found several duplicates e.g.
Customer# FirstName LastName State
752 Michelle Cortel Texas
254 Michelle Cortel Texas

Even though I dont use the soundex, I could find duplicates in Residual.

Thanks!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What aggregate weights did you get for these "duplicates"? How common or rare is "Michelle" in your data?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
cristty
Participant
Posts: 17
Joined: Tue Jun 22, 2010 5:19 am

Post by cristty »

Are you sure it's a good idea to have all 3 columns both present for blocking and for matching?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

marionpt wrote:In our version (8.0.1) there is no Soundex, we only have Reverse Soundex (RVSNDX) and NYSIIS.
You DO have Soundex - it's just that the rule set you're using does not use it to generate any matching field. You could create a custom rule set that does use it, if that's what you want.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

cristty wrote:Are you sure it's a good idea to have all 3 columns both present for blocking and for matching?
Yeah, seems kind of pointless.

Blocking fields should be there to weed out the junk, and matching fields are to differentiate the ones that pass the block by giving them a score.
If you match on exactly the same fields as the block, then everything that makes it through the match pass will be an exact match (on the loose key) and the only score differentiation will be because of less common sounding terms as per the frequency analysis.

Even if the blocking fields and matching fields aren't the same, you want to check against actual values (or tight keys) in the match pass, so it can score the ones that are exact (or almost) higher than the ones that just sneak through via the NYSIIS.

Exact checking on loose keys = blocking
Fuzzy checks on loose keys = minimal matching value
Fuzzy checks on exact values = better matching value
Post Reply