lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Kozhemiakin <Alexey_Kozhemia...@epam.com>
Subject RE: Document Similarity Algorithm at Solr/Lucene
Date Mon, 05 Aug 2013 11:12:43 GMT
We considered MLT component to implemented a sort of "near exact duplicate detection" - which
is probably very similar to your task.

http://wiki.apache.org/solr/MoreLikeThis 

You may think of MoreLikeThis as a two phase process (transform a document to query and run
it): 
	1a) it tokenizes input document to words stream
	1b) removes too long and too short words (mlt.minwl, mlt.maxwl parameters) and rare words
based on their appearance in this document and entire corpus (mlt.mintf, mlt.mindf parameters)
	1c) optionally assigns a boost(tf-idf coeff) to remaining words  (mlt.boost parameter)
	1d) then it takes top N=mlt.maxqt words  to shorten the query 
	2) executes a "regular" query against fields listed in mlt.fl
As you see there's no magic here. All steps are configurable and you can chose balance between
performance and quality.

Our near exact duplication algorithm relying of a fact that a document should be most similar
to itself, so if we have D1...D5 documents, then  - MLT(D2)={D2, D1, D4, D5, D4}.
In order to achieve this we had to disable field Norms to avoid shorter documents being more
similar to given document  then itself. 
Therefore we can normalize document score as 100% and consider as a "near exact duplicate"
every other doc which returned from MLT with score above certain threshold (90% in our case).

We also added a secret sauce  - filter documents by length in spot of +/- 20%. Otherwise,
because of "fingerprint" nature of MLT we get incorrectly identified documents that were too
long or too short.

In a nutshell the flow is as described above:
1) index document - to ensure it will be returned by MLT
2) execute MLT
3) take docs which length is inside +/-20% range of this document
4) normalize all scores by score of first document (should be itself)
5) take docs which score is >90%



Hope it helps, please let me know if you need more details.


Alex 


-----Original Message-----
From: Furkan KAMACI [mailto:furkankamaci@gmail.com] 
Sent: Thursday, July 25, 2013 11:18
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use at underlying?


2013/7/24 Roman Chyla <roman.chyla@gmail.com>

> This paper contains an excellent algorithm for plagiarism detection, 
> but beware the published version had a mistake in the algorithm - look 
> for corrections - I can't find them now, but I know they have been 
> published (perhaps by one of the co-authors). You could do it with 
> solr, to create an index of hashes, with the twist of storing position 
> of the original text (source of the hash) together with the token and 
> the solr highlighting would do the rest for you :)
>
> roman
>
>
> On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant <skant@sloan.mit.edu> wrote:
>
> > Here is a paper that I found useful:
> > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> >
> >
> > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI 
> > <furkankamaci@gmail.com>
> > wrote:
> > > Thanks for your comments.
> > >
> > > 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >
> > >> if you need a specialized algorithm for detecting blogposts
> plagiarism /
> > >> quotations (which are different tasks IMHO) I think you have 2
> options:
> > >> 1. implement a dedicated one based on your features / metrics / 
> > >> domain 2. try to fine tune an existing algorithm that is flexible 
> > >> enough
> > >>
> > >> If I were to do it with Solr I'd probably do something like:
> > >> 1. index "original" blogposts in Solr (possibly using Jack's
> suggestion
> > >> about ngrams / shingles)
> > >> 2. do MLT queries with "candidate blogposts copies" text 3. get 
> > >> the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. 
> > >> eventually train a classifier to help you mark other texts as
> quote /
> > >> plagiarism
> > >>
> > >> HTH,
> > >> Tommaso
> > >>
> > >>
> > >>
> > >> 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> > >>
> > >> > Actually I need a specialized algorithm. I want to use that
> algorithm
> > to
> > >> > detect duplicate blog posts.
> > >> >
> > >> > 2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I you may leverage and / or improve MLT component [1].
> > >> > >
> > >> > > HTH,
> > >> > > Tommaso
> > >> > >
> > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> > >> > >
> > >> > >
> > >> > > 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
> > >> > >
> > >> > > > Hi;
> > >> > > >
> > >> > > > Sometimes a huge part of a document may exist in another
> > document. As
> > >> > > like
> > >> > > > in student plagiarism or quotation of a blog post at 
> > >> > > > another
> blog
> > >> post.
> > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.)
has 
> > >> > > > any
> > class
> > >> > to
> > >> > > > detect it?
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Mime
View raw message