lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Detecting duplicates
Date Thu, 10 Mar 2011 15:35:22 GMT
My understanding is It can mark documents with the same signature 
indicating that they are similar however there is no way at query time 
to return only 1 "unique" document per signature. Am I missing something?

Doc 1) This is my test
Doc 2) This is my test
Doc 3) Another test
Doc 4) This is my test

If I run a query for "test" it should return

Doc 1) This is my test
Doc 3) Another test


On 3/10/11 6:25 AM, Grant Ingersoll wrote:
> On Mar 5, 2011, at 8:35 PM, Mark wrote:
>
>> I'm familiar with Deduplication however I do not wish to remove my duplicates and
my needs are slightly different. I would like to mark the first document with signature 'xyz'
as unique but the next one as a duplicate. This way I can filter out "duplicates" during searching
using a filter query but still return the original document.
> My understanding is that you can have it mark duplicates.
>
>> The only thing I know of at the moment is to use field collapsing but I tried the
patch on 1.4.1 and it was terribly slow.
>>
>> On 3/5/11 4:43 AM, Grant Ingersoll wrote:
>>> See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy to pull
out if you are doing just Lucene.
>>>
>>> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>>>
>>>> Is there a way one could detect duplicates (say by using some unique hash
of certain fields) and marking a document as a duplicate but not remove it.
>>>>
>>>> Here is an example:
>>>>
>>>> Doc 1) This is my test
>>>> Doc 2) This is my test
>>>> Doc 3) Another test
>>>> Doc 4) This is my test
>>>>
>>>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked
as duplicates (of doc 1).
>>>>
>>>> Can this be easily accomplished?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message