Page 1 of 1

Multi-valued domain matching

Posted: Wed Apr 10, 2013 11:57 am
by ymadden@bmi.com
I'm not sure if the post title is accurate. Maybe that is why I'm not getting any search hits.

We have a requirement to match on attributes that can have multiple values, e.g., the multiple contacts below
Reference Set -
Company A, Contacts {Person A, Person B, Person C}
Company B, Contacts {Person D, Person E}
Data Set -
Company A, Contacts {Person B}

The desired result is that Company A should match based on the Person B contact. I'm not exactly sure how to accomplish this. Exploding the reference set so that there is a record per Person seems unreasonable. Would using a vector or subrecord column type be the correct approach? The documentation seems a little confusing.

Any input is appreciated.

Posted: Wed Apr 10, 2013 2:30 pm
by ray.wurlod
How are your data organised? In particular,does the Contacts table have one row per Person/Company combination? If so, you can pretty much use an exact match (or UNCERT if there are likely to be spelling errors) based on Company.

Posted: Sun Apr 14, 2013 9:38 am
by ymadden@bmi.com
thanks Ray. Our data is in relational tables, so there's a Company table and a Contacts table. In reality, they're not company and contacts, those are just simple examples. We also have other multi-valued attributes of the business objects that we need to match on. Some of these need to be used in blocking strategies too. We've never used the subrecord or vector structures, so I'm wondering if they would be a better alternative to transforming the data into all possible combinations of every multi-valued attribute. Producing these combinations could create billions of records in the datasets for some business objects.

Posted: Sun Apr 14, 2013 2:16 pm
by ray.wurlod
You have to ask yourself whether these billions of rows are the correct answer. If not, tighten your blocking or matching criteria. As a general rule, blocking in the first pass should eliminate as much as possible from each block of "matchable" records.

Posted: Wed Apr 17, 2013 7:15 pm
by ymadden@bmi.com
That billion row question is the exact one I am asking myself now. I've read back over the documentation and can see how we can restructure the billion rows by combining records to have vector columns. Has anyone ever actually used a vector column in a match command, and what were the benefits (performance, accuracy, etc.) or drawbacks of doing so?

Posted: Wed Apr 17, 2013 8:05 pm
by ray.wurlod
I think you would be better off parsing the multi-valued fields into separate words (for example by converting the internal delimiter character to space) and using the MULT_EXACT or MULT_UNCERT comparisons in QualityStage match.