lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Similar Document Search
Date Tue, 19 Aug 2003 03:42:59 GMT
Hi Peter,

What got me thinking about this was the way that Lucene computes similarity
(or scoring).  After the boolean keyword matches have been found, Lucene
then computes relevance.  What Lucene does, I think, is to process the query
into some intermediate internal representation and computes the similarity
between the query (now a kind of a pseudo-document) and each of the matching
hits.

I was wondering if there might not be a way to internally process a selected
document (rather than the query per se) and then, in effect, compute the
similarity between that document and all the other documents (which have
already been pre-processed in the indexing process).  So, what you'd be
doing is not a boolean keyword match, but a ranking of all the documents in
the repository on the basis of relevance or similarity to the target
document.

(If that's not too far off in terms of reality, maybe Doug could comment?)

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pbecker@dstc.edu.au>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, August 18, 2003 9:05 PM
Subject: Re: Similar Document Search


> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided
> that most likely the only good solution to this is to keep a
> non-inverted index, i.e. a map from the documents to the terms. Then you
> can query the most terms for the documents and query other documents
> matching parts of this (where you get the usual question of what is
> actually interesting: high frequency, low frequency or the mid range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem to
> support changes in the index, and the index for the terms would change
> all the time. We haven't implemented it yet, but it shouldn't be hard to
> code. I just wouldn't expect good performance when indexing large
> collections.
>
>   Peter
>
>
> Terry Steichen wrote:
>
> >Is it possible without extensive additional coding to use Lucene to
conduct a search based on a document rather than a query?  (One use of this
would be to refine a search by selecting one of the hits returned from the
initial query and subsequently retrieving other documents "like" the
selected one.)
> >
> >Regards,
> >
> >Terry
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


Mime
View raw message