lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devon H. O'Dell" <devon.od...@gmail.com>
Subject Re: Detecting duplicates
Date Sat, 05 Mar 2011 16:44:57 GMT
There is a DuplicateFilter class in contrib that works pretty well.

2011/3/5 Grant Ingersoll <gsingers@apache.org>:
> See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull out if
you are doing just Lucene.
>
> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>
>> Is there a way one could detect duplicates (say by using some unique hash of certain
fields) and marking a document as a duplicate but not remove it.
>>
>> Here is an example:
>>
>> Doc 1) This is my test
>> Doc 2) This is my test
>> Doc 3) Another test
>> Doc 4) This is my test
>>
>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as duplicates
(of doc 1).
>>
>> Can this be easily accomplished?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message