lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <roman.ch...@gmail.com>
Subject Re: Document Similarity Algorithm at Solr/Lucene
Date Wed, 24 Jul 2013 17:58:50 GMT
This paper contains an excellent algorithm for plagiarism detection, but
beware the published version had a mistake in the algorithm - look for
corrections - I can't find them now, but I know they have been published
(perhaps by one of the co-authors). You could do it with solr, to create an
index of hashes, with the twist of storing position of the original text
(source of the hash) together with the token and the solr highlighting
would do the rest for you :)

roman


On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <skant@sloan.mit.edu> wrote:

> Here is a paper that I found useful:
> http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
>
>
> On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI <furkankamaci@gmail.com>
> wrote:
> > Thanks for your comments.
> >
> > 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
> >
> >> if you need a specialized algorithm for detecting blogposts plagiarism /
> >> quotations (which are different tasks IMHO) I think you have 2 options:
> >> 1. implement a dedicated one based on your features / metrics / domain
> >> 2. try to fine tune an existing algorithm that is flexible enough
> >>
> >> If I were to do it with Solr I'd probably do something like:
> >> 1. index "original" blogposts in Solr (possibly using Jack's suggestion
> >> about ngrams / shingles)
> >> 2. do MLT queries with "candidate blogposts copies" text
> >> 3. get the first, say, 2-3 hits
> >> 4. mark it as quote / plagiarism
> >> 5. eventually train a classifier to help you mark other texts as quote /
> >> plagiarism
> >>
> >> HTH,
> >> Tommaso
> >>
> >>
> >>
> >> 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> >>
> >> > Actually I need a specialized algorithm. I want to use that algorithm
> to
> >> > detect duplicate blog posts.
> >> >
> >> > 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
> >> >
> >> > > Hi,
> >> > >
> >> > > I you may leverage and / or improve MLT component [1].
> >> > >
> >> > > HTH,
> >> > > Tommaso
> >> > >
> >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> >> > >
> >> > >
> >> > > 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> >> > >
> >> > > > Hi;
> >> > > >
> >> > > > Sometimes a huge part of a document may exist in another
> document. As
> >> > > like
> >> > > > in student plagiarism or quotation of a blog post at another
blog
> >> post.
> >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any
> class
> >> > to
> >> > > > detect it?
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message