How to calculate match stage scores
Posted: Mon Jul 23, 2018 9:03 am
I am new to Match Stage and i am trying to get the hands on with it.
I started working with the examples provided by IBM for Match Stage from the IBM site.
Well I am trying to find how the agreement and disagreement values are calculated for individual fields. In the documentation they have given the formula as follows:
Agreement score: log2(m probability / u probability)
Disagreement score: log2((1 - m probability)/(1 - u probability))
I started calculating individual field scores manually for a master record from the output of match pass from sample data.
The values are calculated for the master record, so the comparision is between the self and not comparing with any other record in the file.
Below is the table which shows scores from my calucaltion and match stage.
Note: For user readability please copy this table into excel and split the columns using space as delimiter.
Matching_field_names Comparision m-Probability u-Probability Param1 Agreement_weight Disagreement_weight Match/Notmatch/BLANK Manual_Scores Match_stage_scores
MatchPrimaryName_USNAME UNCERT 0.99 0.01 700 6.62935662 -6.62935662 Match 6.62935662 12.69
HouseNumber_USADDR CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 12.3
HouseNumberSuffix_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 Match 6.491853096 4.26
StreetPrefixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 13.1
StreetName_USADDR UNCERT 0.99 0.01 800 6.62935662 -6.62935662 Match 6.62935662 0
StreetSuffixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
RuralRouteValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BoxValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
FloorValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
UnitValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BuildingName_USADDR UNCERT 0.9 0.01 800 6.491853096 -3.307428525 BLANK 0 0
ZipCode_USAREA CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 10.17
Composite_Weights:
Expected_score Actual_Score_from_Match_stage
33.00927958 52.52
In the above table last two columns represent the manual vs scores from match stage.
There is a considerable difference in the composite weights after adding the individual scores between my calucation and match stage(33.009 vs 52.52).
Am I missing any thing in my calucation to get the accurate results?
What is the significance of Param1 value in the calculation?
There is not much documentation provided in the web. So I am approaching dsxchange to find my answers.
Any help is much appreciated. Thank you!
I started working with the examples provided by IBM for Match Stage from the IBM site.
Well I am trying to find how the agreement and disagreement values are calculated for individual fields. In the documentation they have given the formula as follows:
Agreement score: log2(m probability / u probability)
Disagreement score: log2((1 - m probability)/(1 - u probability))
I started calculating individual field scores manually for a master record from the output of match pass from sample data.
The values are calculated for the master record, so the comparision is between the self and not comparing with any other record in the file.
Below is the table which shows scores from my calucaltion and match stage.
Note: For user readability please copy this table into excel and split the columns using space as delimiter.
Matching_field_names Comparision m-Probability u-Probability Param1 Agreement_weight Disagreement_weight Match/Notmatch/BLANK Manual_Scores Match_stage_scores
MatchPrimaryName_USNAME UNCERT 0.99 0.01 700 6.62935662 -6.62935662 Match 6.62935662 12.69
HouseNumber_USADDR CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 12.3
HouseNumberSuffix_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 Match 6.491853096 4.26
StreetPrefixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 13.1
StreetName_USADDR UNCERT 0.99 0.01 800 6.62935662 -6.62935662 Match 6.62935662 0
StreetSuffixDirectional_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
RuralRouteValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BoxValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
FloorValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
UnitValue_USADDR CHAR 0.9 0.01 NA 6.491853096 -3.307428525 BLANK 0 0
BuildingName_USADDR UNCERT 0.9 0.01 800 6.491853096 -3.307428525 BLANK 0 0
ZipCode_USAREA CHAR 0.99 0.01 NA 6.62935662 -6.62935662 Match 6.62935662 10.17
Composite_Weights:
Expected_score Actual_Score_from_Match_stage
33.00927958 52.52
In the above table last two columns represent the manual vs scores from match stage.
There is a considerable difference in the composite weights after adding the individual scores between my calucation and match stage(33.009 vs 52.52).
Am I missing any thing in my calucation to get the accurate results?
What is the significance of Param1 value in the calculation?
There is not much documentation provided in the web. So I am approaching dsxchange to find my answers.
Any help is much appreciated. Thank you!