lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <furkankam...@gmail.com>
Subject Re: Near Duplicate Document Detection at Solr
Date Sun, 22 Sep 2013 19:14:23 GMT
I've also know that there is another mechanism at Solr:
http://wiki.apache.org/solr/Deduplication I think that I should add a
custom signature because that is the most usable one for me:
http://wiki.apache.org/solr/TextProfileSignature On the other hand are
there any limitation for deduplication at SolrCloud?

What do you think?


2013/9/22 Furkan KAMACI <furkankamaci@gmail.com>

> I want to detect near duplicate documents (for web documents). I know that
> there is an algorithm called Winnowing and there is another technique used
> by Google. However I also know that Solr has a component called
> MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> easy to detect but near duplicate detection is much more behind it.
>
> So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> component uses and can I use it for such kind of purposes?
>
> Otherwise, I will implement an algorithm for near duplicate document
> detection within few days and I will be proud to contribute and adopt it
> into Solr.
>
> Thanks;
> Furkan KAMACI
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message