lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Find Documents 'Similar' to Another
Date Fri, 30 May 2003 10:43:52 GMT
John Cwikla wrote:

>Depends what "similar" means.
>If by similar, you mean they contain alot of the same words/phrases, then
>you can probably use
>a query (although the number you can have is limited to 32 or 64 I think)
>and get documents
>using lucene.

I have a demo site that does this.
I thought I posted a mention of this a week ago in a different thread 
but it seems that
my mailer ate the outgoing msg so I don't think this is a dup....

Anyway, the site is:

I index all RFCs w/ Lucene.

When you're viewing an RFC at a URL like this:

if you click on 'show similar' then you go to a URL like this:

the implementation takes the source RFC, forms a query with all words in 
it, and then
passes this large query to lucene. The display lists docs that are 
'similar' to the
source doc, and the little blue bar graph is bigger for more similar docs.

The nature of what's going on is that the more 2 documents have words in 
common, and the
more they have rare words in common, the more similar they are.

Similarity is not by concept, and even alternate spellings throw off 
this simple technique.

Site warning: linux JVM just crashed so I upgraded to 1.4.2b which is 
hopefully more stable.

>If by similar you are trying to determine if the text in some documents is
>byte/byte the same
>except for some small deviations, you are probably interested in using a
>nilsima signature.
>If you have some words/phrases that give you are starting point of documents
>to check for similiarity,
>you could use Lucene first, and then nilsima second.
>If you are talking about conceptual similarity, you probably have a big
>research project on your
>-----Original Message-----
>From: Bruce Ritchie []
>Sent: Thursday, May 29, 2003 1:41 PM
>To: Lucene Users List
>Subject: Re: Find Documents 'Similar' to Another
>Wirthlin, Rick - Workstream wrote:
>>I have a requirement to find documents similar to another.  Can that be
>accomplished using a PhraseQuery, or some other way?
>One option I'm looking at to get this functionality is the InfoWrangler
>( as it does this and seems to be at least partially
>based upon lucene. If 
>other people know of other (available) options I'd love to hear of them as
>Bruce Ritchie
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message