lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Detecting duplicates
Date Thu, 10 Mar 2011 14:25:44 GMT

On Mar 5, 2011, at 8:35 PM, Mark wrote:

> I'm familiar with Deduplication however I do not wish to remove my duplicates and my
needs are slightly different. I would like to mark the first document with signature 'xyz'
as unique but the next one as a duplicate. This way I can filter out "duplicates" during searching
using a filter query but still return the original document.

My understanding is that you can have it mark duplicates.  

> 
> The only thing I know of at the moment is to use field collapsing but I tried the patch
on 1.4.1 and it was terribly slow.
> 
> On 3/5/11 4:43 AM, Grant Ingersoll wrote:
>> See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull out
if you are doing just Lucene.
>> 
>> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>> 
>>> Is there a way one could detect duplicates (say by using some unique hash of
certain fields) and marking a document as a duplicate but not remove it.
>>> 
>>> Here is an example:
>>> 
>>> Doc 1) This is my test
>>> Doc 2) This is my test
>>> Doc 3) Another test
>>> Doc 4) This is my test
>>> 
>>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as duplicates
(of doc 1).
>>> 
>>> Can this be easily accomplished?
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message