Probabilistic Matching Service function

  2 posts   Feedicon  
Replies: 1 - Last Post: November 28, 2011 20:09
by: Csaba Toth
showing 1 - 2 of 2
 
Posted: October 20, 2011 15:26 by ldcamachoc
----------------------------------------------------
[b]First doubt[b]

Can you explain me, how use Probabilistic Matching Service?

I have used this service but I can´t understad all. I'm going to describe the steps and the knowlodge that I have seen.

1.- First all, I realized when you start Probabilistic Matching Service in the method ProbabilisticMatchingService.init() if you don´t have the file FellegiSunterConfiguration.ser.
thows an Exeption.

2.- After that execute ProbabilisticMatchingService.linkRecords() and if you don´t have any records in your database at the moment when you uses

List<RecordPair> pairs = getRecordPairs();

If List<RecordPair> pairs is empty after that throw an exeption because when you try doing the follows methods


scoreRecordPairs(pairs);
fellegiSunterParams = new FellegiSunterParameters(matchConfiguration.getMatchFields().size());

fellegiSunterParams.setMu(matchConfiguration.getFalsePositiveProbability());
fellegiSunterParams.setLambda(matchConfiguration.getFalseNegativeProbability());

calculateVectorFrequencies(pairs, fellegiSunterParams);
estimateMarginalProbabilities(fellegiSunterParams); //execute incorrect
calculateRecordPairWeights(pairs, fellegiSunterParams);
orderRecordPairsByWeight(pairs);
calculateMarginalProbabilities(pairs, fellegiSunterParams);
calculateBounds(pairs, fellegiSunterParams);

saveParameters(matchConfiguration.getConfigFileDirectory(), fellegiSunterParams);
initialized = true;

This works incorrect,

I put some validation in case if List<RecordPair> pairs is empty, but the method calculate incorrect. They show me wrong values in fellegiSunterParams.

4.-For that reason I think the algorithm doesn´t execute good.


-----------------------------------------------------------------------------------------
[b]Second doubt [b]

5.- After that with records into database, I don´t know how the way insert into database, because I have seen that I can add person with differents identifier domain and I can add person without identifier domain. For example:

//TEST 1

//First Person whith Identifier Domain

Person person = new Person();
person.setGivenName("Juan");
person.setFamilyName("Escutia");
person.setAddress1("2930 Oak Shadow Drive");
person.setCity("Oak Hill");
person.setState("Virginia");

PersonIdentifier pi = new PersonIdentifier();
pi.setIdentifier("LDCC113XX01");

IdentifierDomain id = new IdentifierDomain();
id.setNamespaceIdentifier("3.1417");
pi.setIdentifierDomain(id);

person.addPersonIdentifier(pi);

//ANOTHER TEST

//Second Person without Identifier

Person person = new Person();
person.setGivenName("Juan");
person.setFamilyName("Esckutia");
person.setAddress1("2930 Oak Shadow Drive");
person.setCity("Oak Hill");
person.setState("Virginia");

This mehtod that I used is PersonServiceTest.testAddPerson().

Well, I can see tha into table person_link show me if the person have relationship with another person. But I can´t undestand If you can add many people with the same parameters only without identifier domain, You might add a lot people into table person
that themselves is similar.

My question is: Is this the only way to know if is the same person, only with the table person_link?

Can you try to explain me this point?

I'm working Open-EMPI 2.1.2
 
Posted: November 28, 2011 20:09 by Csaba Toth
I don't know if that will answer your questions, but I try to briefly give some clues about the Probablilistic Matching Service. It uses Fellegi-Sunter type of matching technique. That means that the fields of the record pairs compared to each other accordingly, specified by a comparison algorithm (you can configure that, for example JaroWinkler). Then the result is quantized to a binary match/non-match according to the thresholds you can also configure. With these results an Expectation Maximalization will run (see estimateMarginalProbabilities). After that we can calculate probabilities and finally weights and rank the record pairs. Then the ranked records can be classified by the match and non-match thresholds.

Very similar things go on when you just wonder about the existence of a given record in the existing dataset.

The part of the software still needs to be developed further when person links are persisted and stuff.
Replies: 1 - Last Post: November 28, 2011 20:09
by: Csaba Toth
  • Mysql
  • Glassfish
  • Jruby
  • Rails
  • Nblogo
Terms of Use; Privacy Policy;
© 2010, Oracle Corporation and/or its affiliates
(revision 20120518.3c65429)
 
 
Close
loading
Please Confirm
Close