lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Dissel" <mdis...@home.nl>
Subject Re: finding potential duplicate documents
Date Sun, 29 May 2005 11:06:53 GMT
Any tips on this issue?

Thanks

Marco
  ----- Original Message ----- 
  From: Marco Dissel 
  To: java-user@lucene.apache.org 
  Sent: Friday, May 13, 2005 9:05 AM
  Subject: finding potential duplicate documents


  Hello

  I've got many documents that are potentially duplicate (merging several external systems).
Any tips how to find documents that are potentially duplicate (using a variable ranking like
>0.5 match).. 

  I can use the similarity (MoreLikeThis) method from Sandbox, but that's always comparing
one document with the index. Is there a way to give back all the potential duplicate documents
in the index without interating every document in the index and compare it with the other
documents in the index.

  Thanks
  Marco


  ---------------------------------------------------------------------
  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
  For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message