lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Detecting duplicates
Date Sat, 05 Mar 2011 12:43:39 GMT
See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull out if you are
doing just Lucene.

On Mar 5, 2011, at 1:49 AM, Mark wrote:

> Is there a way one could detect duplicates (say by using some unique hash of certain
fields) and marking a document as a duplicate but not remove it.
> 
> Here is an example:
> 
> Doc 1) This is my test
> Doc 2) This is my test
> Doc 3) Another test
> Doc 4) This is my test
> 
> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as duplicates
(of doc 1).
> 
> Can this be easily accomplished?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message