openempi connected to SQL Server2008

  30 posts   Feedicon  
Replies: 29 - Last Post: February 15, 2012 07:09
by: murphyric
showing 1 - 20 of 30
« Previous 1
 
Posted: November 23, 2011 22:39 by murphyric

I connected OpenEMPI to SQL Server 2008 successfully by porting the Postgre schema and also needed a couple of updated JARS in the process. The only issue still is that I'm not able to make the ProbablisticMatchingService do the matching between 2 similar persons. I am waiting for the experts advise. I am deeply in need of it.

Adding persons with exact-match, links the records, but with probablisctic-match, it does not. I request the experts to please give me their suggestions.

Thanks.
 
Posted: November 28, 2011 19:52 by Csaba Toth
That's very interesting, going to SQL Server 2008!
What happens when you try to match with ProbablisticMatchingService? What kind of exception or what is the error?
We need more information in order to be able to help.
 
Posted: December 08, 2011 10:04 by techguy
Hello murphyric

I am also trying to the configure the empi to run on SQL server 2008.
Can you please share the sql scripts other modifications that you did to work with it.

I appreciate you help in advance.

Thanks
 
Posted: November 29, 2011 17:29 by murphyric
Hi,

Thank you for your reply.

What happens when you try to match with ProbablisticMatchingService?
It does not show any links between 2 similar records which were actually linked by Deterministicmatching service in another test.

Below are the main configuration files I am considering during Probablistic match:
applicationContext-resources.xml
applicationContext-dao.xml
applicationContext-service-probabilisticTest.xml
applicationContext-module-probabilistic-matching.xml


MyTest:

I added 2 Persons 'Martin Javier' & 'Marteen Xavier' with Deterministic Matching which added both persons to the 'Person' table and found a record in the 'Person_link' table of these persons.

Updated the configuration to Probablistic Matching and added 2 persons 'Elle Lee' & 'Elle Lei'. Both persons were added to the 'Person' table but did not find any link or record related to these in the 'Person_link' table.
During my debuggin with adding records using ProbablisticMatching I found that 'match()' returns with '0' RecordPairs during matching.

public Set<RecordPair> match(Record record) throws ApplicationException {
log.debug("Looking for matches on record " + record);
if (!initialized) {
throw new ApplicationException("Matching service has not been initialized yet.");
}
List<RecordPair> pairs = Context.getBlockingService().findCandidates(record);
Set<RecordPair> matches = new java.util.HashSet<RecordPair>();
scoreRecordPairs(pairs);
calculateRecordPairWeights(pairs, fellegiSunterParams);

// Apply Fellegi-Sunter decision rule
for (RecordPair pair : pairs) {
if (pair.getWeight() >= fellegiSunterParams.getUpperBound()) {
matches.add(pair);
} else if (pair.getWeight() > fellegiSunterParams.getLowerBound()) {
// This is a possible match; need to add it to the list for review
}
}
return matches;
}




What kind of exception or what is the error?

I did not see any exceptions during the test except the FileNotFoundException in 'loadParameters()' method which is expected and continued further in the process.


We need more information in order to be able to help.
Please let me know if I need to follow any suggested path in order to obtain any further information that will be helpful to investigate or move forward. Just asking if there is any other way than the one I followed in my test.
 
Posted: November 29, 2011 18:01 by Csaba Toth
The whole record linkage procedure can be broken down into some steps:
1. Blocking: this method can significantly decrease the number of record pairs which will be later compared. Without blocking you'd have to examine every possible record pairs between the datasets to be linked. With large datasets this cross-product would be infeasible to handle, so with blocking you rule out lots of record pairs which very likely would not match anyway. Simple example: you compare only those record pairs where the last name's first letters match.
The basic blocking procedure works like this: it only couples those pairs which certain attributes match. The above example is too harsh, what OpenEMPI has in it's configuration by default is to transform some of the fields (givenname, familyname, etc) with DoubleMetaphone transformation into new (custom) columns. Then the basic blocking is configures to run on these columns.

(If you'd like to implement my simple example, then you'd write a simple Transformation service, which takes the first letter of a given field, then you use that transformation to generate custom fields and configure basic blocking to use these. You can also implement your own blocking strategies besides basic blocking.)
2. Then we take the selected record pairs, and perform the actual comparisons among their field pairs individually (configured by a matching configurations). In this step we don't care about the transformed fields but we compare the real given name or family name of the records to each other with some string comparison function.
3. Then comes the actual Fellegi-Sunter procedure, with the Expectation Maximalization first, then calculating weights and ranking the pairs finally. Then the thresholds (configurable) will tell the match/non-match statuses.

I wrote this down to give you some background about what's going on.

What I think is that you have problem with the first step. Blocking doesn't generate you record pairs. Please check out your blocking configuration, and blocking run. Tell us what method you use for blocking, in what configuration.

More importantly, I also wonder if your experiments work with PostgreSQL fine or not. If yes, and the only difference is really your migration to SQL Server 2008, then there can be two problems:
- Either your schema is not fine for OpenEMPI somehow (can you share your SQL Server OpenEMPI schema?). Some fields cannot be found or something???
- There might be some strange behavior in the Hibernate implementation of the basic blocking: it doesn't work with SQL Server as it is intended with PostgreSQL.
Try to debug the getDistinctValues calls and the blockRecords in the PersonDaoHibernate.java (it's in dao/hibernate/PersonDaoHibernate.java). But that advise holds only if the blocking is the culprit.

Let's debug it!
 
Posted: November 29, 2011 23:56 by murphyric



Thank you again for your quick reply.

That was really helping me a lot and I will also work on the simple example you suggested.

Currently ....
Tell us what method you use for blocking, in what configuration.
I am using the basic blocking (<bb:basic-blocking>) in 'mpi-config-probablistic-matching.xml' configuration as-is from the source repository.

The blocking code worked both in the Deterministic and Probablistic matching configs in SQL.
In PostgreSQL with Deterministic it linked the records. I think it did not link the records with Probablistic matching in PostGSQL may be due to the same reasons as explained below in the MyObservations sections.

-I did get some 'value' pairs here in the below line:
List<List<NameValuePair>> values = blockingDao.getDistinctValues(getBlockingFieldList(fields));

-I also got 2 records from the blockRecords() and gone further to form a pair in the FOR-LOOP in the below code.

private void loadRecordPairList(List<NameValuePair> pairs) {
Criteria criteria = getCriteria(pairs);
List<Record> records = recordPairSource.getBlockingDao().blockRecords(criteria);
recordPairs = new ArrayList<RecordPair>();
// If we don't find at least two records then there are no pairs to construct
if (records.size() < 2) {
return;
}
for (int i=0; i < records.size() - 1; i++) {
for (int j = i+1; j < records.size(); j++) {
log.debug("Building record pairs using indices " + i + " and " + j);
RecordPair recordPair = new RecordPair(records.get(i), records.get(j));
recordPairs.add(recordPair);
}
}
}



Observations
During adding a similar record, although it found a pair:
-In the FellegiSunterParams object, the 'mValue' & 'uValue' are as 'NaN'(NotANumber) and so the Weight was also 'NaN' in CalculateRecordPairWeight()
-In the calculateBounds() the 'sum' was 'NaN' due to VectorProbGivenM & VectorProbGivenU are also equal to 'NaN'.
-The fellegiSunterParams.LowerBound &fellegiSunterParams.UpperBound properties are also equal to 'NaN'
-Finally 'matches' set have '0' recordpair in it due to all the above.


public Set<RecordPair> match(Record record) throws ApplicationException {
log.debug("Looking for matches on record " + record);
if (!initialized) {
throw new ApplicationException("Matching service has not been initialized yet.");
}
List<RecordPair> pairs = Context.getBlockingService().findCandidates(record);
Set<RecordPair> matches = new java.util.HashSet<RecordPair>();
scoreRecordPairs(pairs);
calculateRecordPairWeights(pairs, fellegiSunterParams);

// Apply Fellegi-Sunter decision rule
for (RecordPair pair : pairs) {
if (pair.getWeight() >= fellegiSunterParams.getUpperBound()) {
matches.add(pair);
} else if (pair.getWeight() > fellegiSunterParams.getLowerBound()) {
// This is a possible match; need to add it to the list for review
}
}
return matches;
}


I took those screen shots during debugging but could not upload to a file sharing site. Please let me know if you want them to be sent to your email.
 
Posted: November 30, 2011 01:18 by Csaba Toth
So the blocking seems to work. The source of the problem are the NaNs. The mValue and uValue parameters in the FellegiSunterParams are the output results of the Expectation-Maximalization run. I had problems with the EM previously. In your case it concludes to NaN probably because it doesn't have sufficient information to come up with any meaningful statistical result.

Debug the org.openhie.openempi.matching.fellegisunter.ExpectationMaximizationEstimator.java's estimateMarginalProbabilities function. EM is an iterative process, I have $100 that it goes to NaN after the first iteration. This can be caused by some configuration issue (selecting strange fields for matching).
- Try to tune your matching. For example match only to 1 or 2 fields.
- Try to add more records to the experiment.

Matching individual records could work well, if the mValues and the uValues which are specific to your dataset are already calculated by a previous, information rich EM run.

I wonder about Odysses's advices too.
 
Posted: November 30, 2011 01:31 by Csaba Toth
Additional note to my previous comment: the EM issue doesn't seem to be RDBMS specific. So it should pop up with the PostgreSQL OpenEMPI version too. Or in that case you may had some existing FellegiSunter.ser in place accidentally?
 
Posted: November 30, 2011 23:14 by murphyric
Hi
During the test today, I found that the FellegiSunter.ser file is in the path (I think this was created in my previous runs) and so deleted it to create a new one (as per your suggestion) and ran the test once again.

Below are the updates from my test:

-Test Setup: I have 3 records in the MS-SQL DB and I wanted to add a new person and Link with them using Probablisticmatching(mpi-config-probablistic.xml). The GivenName, FamilyName, DOB of the persons already in the DB and the test data are as follows:

1)elle lee 1918-01-29
2)elle lee 1916-05-29
3)ellee lee 1917-05-29
4)elle llee 1917-05-27 (My Test data)

observation
- The FellegiSunter.ser file was created in the Linkrecords() at line 'saveParameters(matchConfiguration.getConfigFileDirectory(), fellegiSunterParams)' method. But the 'mValue' and 'uValue' are still NaN.
 
Posted: December 01, 2011 21:26 by Csaba Toth
So you end up with 3 record pairs during this experiment? If you match to more than 2 or 3 fields the EM won't have enough information to converge to anything. There will be too many unknown variables. The best for EM is if you have lots of record pairs (at least hundreds of them).
Maybe it's worth to take a look at Probabilistic Match test. I don't know what it does from the top of my head, but it probably works with small dataset too.
 
Posted: December 21, 2011 23:19 by murphyric
Hi,
I'm still stuck with the NANs even though I managed to create a database with 200+ records in it. I modified the mpi-config-probablistic.xml file with only the Given& Family names for the blocking and for the PMatching as suggested. The m&u Values end up with [NaN,NaN]. I am in need of your support to resolve this issue.
All is fine with deterministic configuration but I'm in need of help for probablistic.

Thanks in Advance
 
Posted: December 21, 2011 23:23 by Csaba Toth
Hi murphyric,

Please post your matching configuration portion of your mpi-config and the iterations of the EM. Does the EM conclude to NaN after the first iteration? We have to debug the EM and understand why it concludes to NaN.

Csaba
 
Posted: December 21, 2011 23:34 by murphyric
Thanks for the quick reply.
Does the EM conclude to NaN after the first iteration?
Yes.


Below is my config file Pmatching part.
mpi-config-probablistic

<pmTonguerobabilistic-matching>
<pm:false-negative-probability>0.2</pm:false-negative-probability>
<pm:false-positive-probability>0.2</pm:false-positive-probability>
<pm:match-fields>
<pm:match-field>
<pm:field-name>givenName</pm:field-name>
<pm:agreement-probability>0.9</pm:agreement-probability>
<pm:disagreement-probability>0.1</pm:disagreement-probability>
<pm:comparator-function>
<function-name>JaroWinkler</function-name>
</pm:comparator-function>
<pm:match-threshold>0.85</pm:match-threshold>
</pm:match-field>
<pm:match-field>
<pm:field-name>familyName</pm:field-name>
<pm:agreement-probability>0.9</pm:agreement-probability>
<pm:disagreement-probability>0.1</pm:disagreement-probability>
<pm:comparator-function>
<function-name>JaroWinkler</function-name>
</pm:comparator-function>
<pm:match-threshold>0.85</pm:match-threshold>
</pm:match-field>
</pm:match-fields>
<pm:config-file-directory>C:\OpenEMPI_HOME\resources</pm:config-file-directory>
</pmTonguerobabilistic-matching>



 
Posted: December 22, 2011 21:46 by murphyric

I deleted the 'FellegiSunterConfiguration.ser' initially and started the test.

estimateMarginalProbabilities Code: with SOPs
[i]
public synchronized void estimateMarginalProbabilities(
FellegiSunterParameters params, double mInitial, double uInitial,
double pInitial, int maxIterations) {
initializeAlgorithm(params, mInitial, uInitial, pInitial);

double error = 1.0;
int iteration = 1;
do {
// Expectation Step
estimateGammaOfM(params.getVectorCount());
estimateGammaOfU(params.getVectorCount());

// Maximization Step
estimateMOfI(params.getFieldCount(), params.getVectorCount(),
params.getVectorFrequencies());
estimateUOfI(params.getFieldCount(), params.getVectorCount(),
params.getVectorFrequencies());
double pPrevious = p;
estimateProbability(params.getVectorCount(), params
.getVectorFrequencies());
error = Math.abs(pPrevious - p);
log.trace("Error at iteration " + iteration + " is " + error);
iteration++;
} while (error > CONVERGENCE_ERROR && iteration < maxIterations);

params.setMValues(mOfI);
params.setUValues(uOfI);
System.out.println("EM: fellegiSunterParams-" + params);

if (gOfM.length > 0) {
for (int i = 0; i < gOfM.length; i++) {
System.out.println("EM: gOfM-" + i + "-" + gOfM[i]);
}

}
if (gOfU.length > 0) {
for (int i = 0; i < gOfU.length; i++) {
System.out.println("EM: gOfU-" + i + "-" + gOfU[i]);
}
}
if (mOfI.length > 0) {
for (int i = 0; i < mOfI.length; i++) {
System.out.println("EM: mOfI-" + i + "-" + mOfI[i]);
}
}
if (uOfI.length > 0) {
for (int i = 0; i < uOfI.length; i++) {
System.out.println("EM: uOfI-" + i + "-" + uOfI);
}
}
}



Records Inserted for testing
person_id given_name middle_name family_name family_name2 prefix suffix name_type_cd date_of_birth
1 elle NULL lee NULL NULL NULL NULL 1917-12-29
2 elle NULL lee NULL NULL NULL NULL 1917-12-29
3 elle NULL lee NULL NULL NULL NULL 1917-12-29

While inserting the 1st record, below is the log:

java.io.FileNotFoundException: C:\OpenEMPI_HOME\resources\FellegiSunterConfiguration.ser


21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:22,425 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:31,488 INFO [STDOUT] EM: fellegiSunterParams-org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,vectorFrequencies={0,0,0,0},vectorCount=4,mValues=
{NaN,NaN},uValues={NaN,NaN},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:53:31,498 INFO [STDOUT] EM: gOfM-0-1.2468827930174557E-4
21:53:31,498 INFO [STDOUT] EM: gOfM-1-0.009999999999999997
21:53:31,498 INFO [STDOUT] EM: gOfM-2-0.009999999999999997
21:53:31,508 INFO [STDOUT] EM: gOfM-3-0.45
21:53:31,508 INFO [STDOUT] EM: gOfU-0-0.9998753117206983
21:53:31,508 INFO [STDOUT] EM: gOfU-1-0.99
21:53:31,548 INFO [STDOUT] EM: gOfU-2-0.99
21:53:31,548 INFO [STDOUT] EM: gOfU-3-0.55
21:53:31,548 INFO [STDOUT] EM: mOfI-0-NaN
21:53:31,548 INFO [STDOUT] EM: mOfI-1-NaN
21:53:31,548 INFO [STDOUT] EM: uOfI-0-NaN
21:53:31,548 INFO [STDOUT] EM: uOfI-1-NaN


While inserting the 2nd record, below is the log

21:54:05,487 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:06,168 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:06,499 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:07,150 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:07,590 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:07,891 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:08,461 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:08,872 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:09,122 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:09,863 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:10,104 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:54:10,264 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:55:13,415 INFO [STDOUT] EM: fellegiSunterParams-org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@ba3f75[fieldCount=2,vectorFrequencies={0,0,0,1},vectorCount=4,mValues={
NaN,NaN},uValues={NaN,NaN},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
21:55:13,685 INFO [STDOUT] EM: gOfM-0-NaN
21:55:13,825 INFO [STDOUT] EM: gOfM-1-NaN
21:55:14,006 INFO [STDOUT] EM: gOfM-2-NaN
21:55:14,146 INFO [STDOUT] EM: gOfM-3-0.45
21:55:14,386 INFO [STDOUT] EM: gOfU-0-NaN
21:55:14,566 INFO [STDOUT] EM: gOfU-1-NaN
21:55:14,747 INFO [STDOUT] EM: gOfU-2-NaN
21:55:14,867 INFO [STDOUT] EM: gOfU-3-0.55
21:55:15,458 INFO [STDOUT] EM: mOfI-0-NaN
21:55:15,648 INFO [STDOUT] EM: mOfI-1-NaN
21:55:16,079 INFO [STDOUT] EM: uOfI-0-NaN
21:55:16,830 INFO [STDOUT] EM: uOfI-1-NaN



[u]While Inserting 3rd record for matching, Below are the record pairs from Blocking:[/u]

[org.openhie.openempi.model.RecordPair@136d42c[leftRecord=org.openhie.openempi.model.Record@1889c9f[recordId=3,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@1e29df9,recordTypeDefinition=<null>],rightRecord=org.openhie.openempi.model.Record@a39c81[recordId=1,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@178069d,recordTypeDefinition=<null>],weight=<null>], org.openhie.openempi.model.RecordPair@fa5d5b[leftRecord=org.openhie.openempi.model.Record@1889c9f[recordId=3,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@1e29df9,recordTypeDefinition=<null>],rightRecord=org.openhie.openempi.model.Record@1073c56[recordId=2,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@a3bdc,recordTypeDefinition=<null>],weight=<null>], org.openhie.openempi.model.RecordPair@1e0095c[leftRecord=org.openhie.openempi.model.Record@1889c9f[recordId=3,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@1e29df9,recordTypeDefinition=<null>],rightRecord=org.openhie.openempi.model.Record@186a830[recordId=3,dynaBean=org.openhie.openempi.util.ConvertingWrapDynaBean@18c1f7d,recordTypeDefinition=<null>],weight=<null>]]
 
Posted: December 22, 2011 21:51 by murphyric
You may not be able to see especially the log completely, I think you can copy the whole text and paste on a notepad, you may see the entire log.

21:53:22,375 INFO [STDOUT] EM-fellegiSunterParams::org.openhie.openempi.matching.fellegisunter.FellegiSunterParameters@172003d[fieldCount=2,
vectorFrequencies={0,0,0,0},vectorCount=4,mValues={0.0,0.0},uValues={0.0,0.0},lowerBound=0.0,upperBound=0.0,lambda=0.20000000298023224,mu=0.20000000298023224]
 
Posted: December 24, 2011 16:47 by Csaba Toth
Don't worry about the log's visual truncation.
Let's focus first on the first problem I see: vectorFrequencies={0,0,0,0}

Let's say you want to link m>0 number of records. After the blocking you generate n>0 number of record pairs from those records.
The sum of the four numbers in the vector frequencies should be n. Recite what I said previously (I realized that I wasn't totally precise, I modified it a little, read the no 1 point again):

0.) How many record pairs you have at the end of the blocking. Are those record pair look good/reasonable?
1.) What are the vector frequencies? (it's within the FellegiSunterParameters). With 2 field count you have to have 2^2=4 combinations for the two fields: nonmatch-nonmatch, nonmatch-match, match-nonmatch, match-match. The vector frequencies will tell how many of your _record_pairs_ will fall into which category (given the match threshold - so on every record pair the JaroWinkler will be calculated for the two fields (in the pair), and the threshold (0.85) will tell if it matches or non-matches from the viewpoint of the EM).

Every record pair should fall into one of the 4 categories. Maybe you even have a problem with my 0th point: blocking doesn't give you back any record pairs, so that's why the frequencies are all 0?

NaN will appear, because the EM will divide by 0:
The estimateGammaOfM and estimateGammaOfU (Expectation Step) doesn't use the vector frequencies, so this step finishes, but then comes the Maximization Step, where both estimateMOfI and estimateUOfI uses it: it's the fOfJ function input parameter, and it multiplies both the numerator and the denominator. Because all of it's elements are 0, in the final division both elements are 0: numeratorSumOfI/denominatorSumOfI
this leads to NaN
Because this will be the input of the next Expectation Step NaN propagates to everywhere, and so on.

All of what I say is based on the assumption that your vectorFrequencies={0,0,0,0}.
I'm very curious what is the cause!
 
Posted: December 22, 2011 02:40 by Csaba Toth
That configuration looks good. The question is then: what are the variable values for the first iteration of EM.
0.) How many record pairs you have at the end of the blocking? Are those record pair look good/reasonable?
1.) What are the vector frequencies? (it's within the FellegiSunterParameters). With 2 matching field you have to have 2^2=4 combinations for the two fields: nonmatch-nonmatch, nonmatch-match, match-nonmatch, match-match. The vector frequencies will tell how many of your record pairs will fall into which category (given the match threshold - so for every record pair the JaroWinkler will be calculated for the two field pairs, and the threshold (0.85) will tell if it matches or non-matches from the viewpoint of the EM).
2.) Next in the estimateMarginalProbabilities:
What is the end result of the first Expectation Step. (gOfM and gOfU values).
3.) What is the end result of the first Maximization Step. (mOfI, uOfI values).
 
Posted: January 01, 2012 18:38 by Csaba Toth
Hi murphyric,

Note, that I replied to your last request on the 24th of December, but because I corrected my other post too (on from the 22nd of Dec), the posts are not in time order by default. So look more above for my answer!

Just like 'techguy', I'm also interested in your SQL Server 2008 scripts.

Thanks,
Csaba
 
Posted: January 04, 2012 05:39 by murphyric
Hi Csaba & techguy,
Happy new year. My apologies for the delay.
I don't see any option here to upload the file and so I copied the contents of the Schema but due to the large text, there were some issues during posting and it failed. Any other alternatives? please suggest.

Thanks




 
Posted: January 05, 2012 15:46 by odysseas
If you email the sql script to me, I can incorporate it into the next release of the source code and make it available immediately to the community on the project site.
Odysseas
odysseas@sysnetint.com
showing 1 - 20 of 30
« Previous 1
Replies: 29 - Last Post: February 15, 2012 07:09
by: murphyric
  • Mysql
  • Glassfish
  • Jruby
  • Rails
  • Nblogo
Terms of Use; Privacy Policy;
© 2010, Oracle Corporation and/or its affiliates
(revision 20120518.3c65429)
 
 
Close
loading
Please Confirm
Close