lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Detecting duplicates
Date Wed, 09 Mar 2011 06:15:38 GMT
Mark,

Keep in mind that there are actually multiple patches for this.  SOLR-236 and 
SOLR-1086 IIRC.
Also, I just noticed this is java-user@lucene.  You may want to continue on 
solr-user@lucene.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Mark <static.void.dev@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Sat, March 5, 2011 8:35:13 PM
> Subject: Re: Detecting duplicates
> 
> I'm familiar with Deduplication however I do not wish to remove my 
> duplicates and my needs are slightly different. I would like to mark the 
> first document with signature 'xyz' as unique but the next one as a 
> duplicate. This way I can filter out "duplicates" during searching using 
> a filter query but still return the original document.
> 
> The only thing  I know of at the moment is to use field collapsing but I 
> tried the patch on  1.4.1 and it was terribly slow.
> 
> On 3/5/11 4:43 AM, Grant Ingersoll  wrote:
> > See http://wiki.apache.org/solr/Deduplication.  Should be  fairly easy to 
>pull out if you are doing just Lucene.
> >
> > On Mar 5,  2011, at 1:49 AM, Mark wrote:
> >
> >> Is there a way one could  detect duplicates (say by using some unique hash 
>of certain fields) and marking  a document as a duplicate but not remove it.
> >>
> >> Here is an  example:
> >>
> >> Doc 1) This is my test
> >> Doc 2) This  is my test
> >> Doc 3) Another test
> >> Doc 4) This is my  test
> >>
> >> Doc 1 and 3 should be considered unique whereas 2  and 4 should be marked as

>duplicates (of doc 1).
> >>
> >> Can  this be easily accomplished?
> >>
> >>  ---------------------------------------------------------------------
> >>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>  For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >  --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene  ecosystem docs using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> >  ---------------------------------------------------------------------
> > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For  additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message