How to calculate match stage scores

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
anand_chal
Premium Member
Premium Member
Posts: 8
Joined: Sat May 28, 2011 7:51 am

How to calculate match stage scores

Post by anand_chal »

I am new to Match Stage and i am trying to get the hands on with it.
I started working with the examples provided by IBM for Match Stage from the IBM site.

Well I am trying to find how the agreement and disagreement values are calculated for individual fields. In the documentation they have given the formula as follows:
Agreement score: log2(m probability / u probability)
Disagreement score: log2((1 - m probability)/(1 - u probability))

I started calculating individual field scores manually for a master record from the output of match pass from sample data.
The values are calculated for the master record, so the comparision is between the self and not comparing with any other record in the file.

Below is the table which shows scores from my calucaltion and match stage.
Note: For user readability please copy this table into excel and split the columns using space as delimiter.

Matching_field_names Comparision m-Probability u-Probability Param1 Agreement_weight Disagreement_weight Match/Notmatch/BLANK Manual_Scores Match_stage_scores
MatchPrimaryName_USNAME UNCERT 0.99 0.01 700 6.62935662 -6.62935662 Match 6.62935662 12.69
HouseNumber_USADDR CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 12.3
HouseNumberSuffix_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 Match 6.491853096 4.26
StreetPrefixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 13.1
StreetName_USADDR UNCERT 0.99 0.01 800 6.62935662 -6.62935662 Match 6.62935662 0
StreetSuffixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
RuralRouteValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BoxValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
FloorValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
UnitValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BuildingName_USADDR UNCERT 0.9 0.01 800 6.491853096 -3.307428525 BLANK 0 0
ZipCode_USAREA CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 10.17

Composite_Weights:
Expected_score Actual_Score_from_Match_stage
33.00927958 52.52


In the above table last two columns represent the manual vs scores from match stage.
There is a considerable difference in the composite weights after adding the individual scores between my calucation and match stage(33.009 vs 52.52).

Am I missing any thing in my calucation to get the accurate results?
What is the significance of Param1 value in the calculation?
There is not much documentation provided in the web. So I am approaching dsxchange to find my answers.

Any help is much appreciated. Thank you!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Param1 has different interpretations for different matching algorithms. For UNCERT, for example, Param1 is the uncertainty threshold (somewhere between 700 and 900, which you CAN find in the documentation).

m-probability is essentially how tight you want the comparisons to be. You can determine that it is (1 - error rate). For example, if you set m-prob to 0.9 then you're prepared to accept an error rate of 10%.

u-probability seeks to put a figure on the possibility that, where a match is found, it is due to random factors. For example, if you set u-prob to 0.001, then you are accepting that one in a thousand matches will be due to random factors.

Agreement and disagreement weights also take into account the "information content", or "rarity value" of particular values within their domains. This information used to be in the Advanced QualityStage class (code 1M413G/2M413G/KM413G for version 11.5) but I'm not certain that it's still part of the course.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anand_chal
Premium Member
Premium Member
Posts: 8
Joined: Sat May 28, 2011 7:51 am

Post by anand_chal »

Thanks for your reply, Ray.

I searched documentation. I am not able to understand why there is a huge difference still. If there is anyother calculation required in addition to agreement and disagreement scores, why it is not documented any where in IBM documentation.

If you could provide me links I will go through them.

Thanks again!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I can't recall that it's anywhere in any of the documentation other than in the training course to which I alluded.

Harald Smith has a DeveloperWorksarticle on how match weights are calculated, which in turn references the QualityStage RedBook. Robert Dickson has added to the DeveloperWorks article more recently.

If you open the Knowledge Center and query "QualityStage computation of weights" you will get "the documentation", most of which you have probably read.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anand_chal
Premium Member
Premium Member
Posts: 8
Joined: Sat May 28, 2011 7:51 am

Post by anand_chal »

ray.wurlod wrote:I can't recall that it's anywhere in any of the documentation other than in the training course to which I alluded.

Harald Smith has a [url=https://www.ibm.com/developerworks/comm ... rums/html/ ...
The link didnt helped me much.. I am still investigating the required formulas to calculate the m and u probabilities with accuracy. Is there a way to get the source code which runs in the match stage background. That way I can find out how the scores are evaluated much easily. Thanks in Advance!
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Firstly, I had to laugh at the idea of IBM giving you the source code. There's no way they're going to give you their secret sauce. Just saying.


On the question at hand though, it's been a while since I dug full-on into this stuff, so here goes not much...

The match frequency files can skew the scores. IIRC, uncommon values tend to result in higher scores than the norm because the chances of random agreement are much lower, eg: if you're matching FirstName, the chances of having 2 people with Ezekiel is much lower than Michael. So Matching on Ezekiel would give you a higher score.
On the flip side, House Number Suffix matching on an A is not worthless, but I'd imagine 3/4 of the non-blank values would be an A so it's worth less than a B, etc.
So if the sample record has unusual values it would make a difference.

You had a score for StreetPrefixDirectional when it was blank? Or should that be on the StreetName match line?
If it's not a typo, do you have score overrides? Seems very strange. Unless you're telling it to count blank as a value and give it a score?

Anyway, being able to predict the values is probably not terribly useful outside the academic exercise, and matching in QS is just as much art as science.

There is a certain amount of effort devoted to reviewing matches, the grey area known as "clerical", and non-matches, to understand what is in each bucket, what you/client want to be in each, and tweak the scores to get you what you need.
Post Reply