lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Document comparison
Date Fri, 18 Feb 2005 22:15:24 GMT
Otis Gospodnetic wrote:

> Matt,
> 
> Erik and I have some code for this in Lucene in Action, but David
> Spencer did this since the book was published:
> 
>   http://www.lucenebook.com/blog/announcements/more_like_this.html


If you want an informal way of doing it you're right, just feed the 
words of the source doc to a query. The doc for the code it is at this 
easy to  remember URL:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/SimilarityQueries.html#formSimilarQuery(java.lang.String,%20org.apache.lucene.analysis.Analyzer,%20java.lang.String,%20java.util.Set)

Follow Otis's link above to my weblog for the code.

The "MoreLikeThis" stuff is similar but more sophisticated.


Also if you want the IR way I think you'd do a "cosine measure". I know 
carrot2 has the code - this might be it:

http://www.searchmorph.com/pub/carrot2/jd/com/chilang/carrot/filter/cluster/rough/measure/CosineCoefficient.html
> 
> Otis
> 
> --- Matt Chaput <matt@sidefx.com> wrote:
> 
> 
>>Is there a simple, efficient way to compute similarity of documents 
>>indexed with Lucene?
>>
>>My first, naive idea is to use the entire contents of one document as
>>a 
>>query to the second document, and use the score as a similarity 
>>measurement. But I think I'm probably way off base with that.
>>
>>Can any IR pros set me straight? Thanks very much.
>>
>>Matt
>>
>>
>>--
>>Matt Chaput
>>Word Monkey
>>Side Effects Software Inc.
>>
>>"A goddamned ray of sunshine all the goddamned time"
>>-- Sparkle Hayter
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message