lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rene Wiermer <>
Subject Re: Usage of Lucene/Hibernate Search for Contacts Merging operation
Date Mon, 12 Oct 2009 12:53:07 GMT
Hash: SHA1

nitingupta183 schrieb:
> Hi all,
> I am supposed to add a feature in which my app will detect the duplicate
> contacts of a user on the basis of their name, email, mobile number
> etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i
> can think of is find all the contacts on the basis of their name, email and
> mobile and then run the loop to determine which all contacts have similar
> entries. But i think this algo will have worst performance.

Try to prune your search space. It is reasonable to assume that there
are not too many duplicates overall.
You can use IndexReader.terms() to get a list of terms and then a
docFreq() to check the number of documents containing that term.

E.g. search for all email terms and process those, whose docFreq is >1.
Add the corresponding documents for each email term to a "possible
identical contacts" container.

Repeat the same with birth dates, phone numbers and names, preferably
with some normalization.

Then merge those "possible identical contact" containers, who share a
common document.

Container 1	Container 2			Merged Container
A, B		B,C				A,B,C

(Implementation note: try to keep track of the list of containers a
certain number is in using a look-up table:  A -> 1; B -> 1,2,3,6; C ->2
etc.  )

Then compare the documents inside these container with each other and
decide, which contacts you want to merge and which not.

> I am currently using Hibernate. I got to know about Hibernate Search/Lucene.
> Can I use these solutions for this task. I am asking this on the basis that
> Lucene already implements algos such as Levenshtein_distance. May be I can
> harness the Lucene power to make this task efficient.

Try using a Soundex or Metaphone analyzer for similarity; they map
similar sounding strings to a single value and are much easier to handle
in the Lucene framework than numeric measures like Levensthein;

there are examples in Lucene contrib.

- --
Rene Wiermer

- -- LWsystems GmbH & Co. KG ++
Tel: 05455 / 932 132 ++ Fax: 05455 / 932 099 ++ Mobil: 0171 / 37 28 760
Ihr Spezialist für Linux, Open Source & IT-Sicherheit
LWsystems GmbH & Co. KG Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad
Telefon +49 (0)5403 5556 Telefax +49 (0)5403 7958997
Handelsregister: Amtsgericht Osnabrück, HRA 110668 USt.-ID-Nr. DE23852211
Persönlich haftende Gesellschafterin: LWsystems Verwaltungs GmbH
Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad Iburg
Handelsregister: Amtsgericht Osnabrück, HRB 111163
Geschäftsführer: Dipl.-Ing. Ansgar H. Licher, Bad Iburg Dipl.-Ing.
Martin Werthmöller, Ibbenbüren
Für weitere Firmendetails zu LWsystems siehe / For further company
details please look at:
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla -


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message