lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Dissel" <>
Subject Re: finding potential duplicate documents
Date Sun, 29 May 2005 11:06:53 GMT
Any tips on this issue?


  ----- Original Message ----- 
  From: Marco Dissel 
  Sent: Friday, May 13, 2005 9:05 AM
  Subject: finding potential duplicate documents


  I've got many documents that are potentially duplicate (merging several external systems).
Any tips how to find documents that are potentially duplicate (using a variable ranking like
>0.5 match).. 

  I can use the similarity (MoreLikeThis) method from Sandbox, but that's always comparing
one document with the index. Is there a way to give back all the potential duplicate documents
in the index without interating every document in the index and compare it with the other
documents in the index.


  To unsubscribe, e-mail:
  For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message