lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Furkan KAMACI <>
Subject Re: Near Duplicate Document Detection at Solr
Date Sun, 22 Sep 2013 19:14:23 GMT
I've also know that there is another mechanism at Solr: I think that I should add a
custom signature because that is the most usable one for me: On the other hand are
there any limitation for deduplication at SolrCloud?

What do you think?

2013/9/22 Furkan KAMACI <>

> I want to detect near duplicate documents (for web documents). I know that
> there is an algorithm called Winnowing and there is another technique used
> by Google. However I also know that Solr has a component called
> MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> easy to detect but near duplicate detection is much more behind it.
> So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> component uses and can I use it for such kind of purposes?
> Otherwise, I will implement an algorithm for near duplicate document
> detection within few days and I will be proud to contribute and adopt it
> into Solr.
> Thanks;
> Furkan KAMACI

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message