lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: Document Similarity Algorithm at Solr/Lucene
Date Tue, 23 Jul 2013 14:25:28 GMT
if you need a specialized algorithm for detecting blogposts plagiarism /
quotations (which are different tasks IMHO) I think you have 2 options:
1. implement a dedicated one based on your features / metrics / domain
2. try to fine tune an existing algorithm that is flexible enough

If I were to do it with Solr I'd probably do something like:
1. index "original" blogposts in Solr (possibly using Jack's suggestion
about ngrams / shingles)
2. do MLT queries with "candidate blogposts copies" text
3. get the first, say, 2-3 hits
4. mark it as quote / plagiarism
5. eventually train a classifier to help you mark other texts as quote /
plagiarism

HTH,
Tommaso



2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>

> Actually I need a specialized algorithm. I want to use that algorithm to
> detect duplicate blog posts.
>
> 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
>
> > Hi,
> >
> > I you may leverage and / or improve MLT component [1].
> >
> > HTH,
> > Tommaso
> >
> > [1] : http://wiki.apache.org/solr/MoreLikeThis
> >
> >
> > 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> >
> > > Hi;
> > >
> > > Sometimes a huge part of a document may exist in another document. As
> > like
> > > in student plagiarism or quotation of a blog post at another blog post.
> > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
> to
> > > detect it?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message