lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Find Me" <findm...@gmail.com>
Subject Re: near duplicates
Date Tue, 24 Oct 2006 14:51:14 GMT
It doesn't make sense to eliminate near duplicates during search time. But
if you are trying to cluster duplicates together then probably you want to
look at Carrot.

On 10/24/06, Beto Siless <beto@tera-code.com.ar> wrote:
>
> Hi Andrej!
>
> I'm taking a look to fuzzy signatures for near duplicate detection and
> and I have seen your TextProfileSignature. The question is: If I index
> the documents with their text signature, is there a way to filter near
> duplicates at search time without comparing each document with all other?
>
> Thanks
> Beto
>
> Andrzej Bialecki wrote:
> > karl wettin wrote:
> >>
> >> 17 okt 2006 kl. 17.54 skrev Find Me:
> >>
> >>> How to eliminate near duplicates from the index?
> >>
> >> I would probably try to measure the Ecludian distance between all
> >> documents, computed on terms and their positions. Or perhaps use
> >> standard deviation to find the distribution of terms in a document.
> >> One would based on the output from that try to find a threashold.
> >> Either way it will consume lots of CPU.
> >
> >
> > There are better ways to achieve this. You need to create a fuzzy
> > signature of the document, based on term histogram or shingles - take a
> > look a the Signature framework in Nutch.
> >
> > There is a substantial literature on this subject - go to Citeseer and
> > run a search for "near duplicate detection".
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message