lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Aristov <alexander.aris...@gmail.com>
Subject Re: Detecting duplicates
Date Thu, 10 Mar 2011 15:49:40 GMT
did you check it
http://wiki.apache.org/solr/Deduplication


Best Regards
Alexander Aristov


On 10 March 2011 18:35, Mark <static.void.dev@gmail.com> wrote:

> My understanding is It can mark documents with the same signature
> indicating that they are similar however there is no way at query time to
> return only 1 "unique" document per signature. Am I missing something?
>
>
> Doc 1) This is my test
> Doc 2) This is my test
> Doc 3) Another test
> Doc 4) This is my test
>
> If I run a query for "test" it should return
>
>
> Doc 1) This is my test
> Doc 3) Another test
>
>
>
> On 3/10/11 6:25 AM, Grant Ingersoll wrote:
>
>> On Mar 5, 2011, at 8:35 PM, Mark wrote:
>>
>>  I'm familiar with Deduplication however I do not wish to remove my
>>> duplicates and my needs are slightly different. I would like to mark the
>>> first document with signature 'xyz' as unique but the next one as a
>>> duplicate. This way I can filter out "duplicates" during searching using a
>>> filter query but still return the original document.
>>>
>> My understanding is that you can have it mark duplicates.
>>
>>  The only thing I know of at the moment is to use field collapsing but I
>>> tried the patch on 1.4.1 and it was terribly slow.
>>>
>>> On 3/5/11 4:43 AM, Grant Ingersoll wrote:
>>>
>>>> See http://wiki.apache.org/solr/Deduplication.  Should be fairly easy
>>>> to pull out if you are doing just Lucene.
>>>>
>>>> On Mar 5, 2011, at 1:49 AM, Mark wrote:
>>>>
>>>>  Is there a way one could detect duplicates (say by using some unique
>>>>> hash of certain fields) and marking a document as a duplicate but not
remove
>>>>> it.
>>>>>
>>>>> Here is an example:
>>>>>
>>>>> Doc 1) This is my test
>>>>> Doc 2) This is my test
>>>>> Doc 3) Another test
>>>>> Doc 4) This is my test
>>>>>
>>>>> Doc 1 and 3 should be considered unique whereas 2 and 4 should be
>>>>> marked as duplicates (of doc 1).
>>>>>
>>>>> Can this be easily accomplished?
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>  --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>  --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message