lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mobius ReX <aoi...@gmail.com>
Subject Fwd: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?
Date Mon, 17 Mar 2014 18:02:19 GMT
---------- Forwarded message ----------
Subject: any project for record linkage, fuzzy grouping, and deduplication
based on Solr/Lucene?


For example, given a new big department merged from three departments. A
few employees worked for two or three departments before merging. That
means, the attributes of one person might be listed under different
departments' databases. One additional problem is that one person can have
different first names or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such runtime fast deduplication tasks for big data
(about 100 million records)?
Any open-source project working on this?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message