lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <>
Subject Re: Similar Document Search
Date Tue, 19 Aug 2003 07:27:10 GMT
Hi Magnus,

thanks for the offer, but unfortunately I can't/don't want to make the 
assumption that I can easily access the documents to re-index them. And 
I don't think this approach would be feasible unless you can keep the 
documents in memory somehow.

Storing the other/non-inverted/normal/whatever index would be expensive 
for indexing, but querying should be a lot faster than having to 
re-index documents. That is in our situation preferable.


Magnus Johansson wrote:

> Hi Peter
> If the original document is available. You could extract keywords from 
> the document
> at query time. That is when someone asks for documents similar to 
> document a. You
> re-analyze document a and in combination with statistics from the 
> Lucene index you extract
> keywords from document a that can then be used as a query for 
> findining similar documents.
> I've got some sample code if anyone is interested.
> /magnus
> Peter Becker wrote:
>> Hi Terry,
>> we have been thinking about the same problem and in the end we 
>> decided that most likely the only good solution to this is to keep a 
>> non-inverted index, i.e. a map from the documents to the terms. Then 
>> you can query the most terms for the documents and query other 
>> documents matching parts of this (where you get the usual question of 
>> what is actually interesting: high frequency, low frequency or the 
>> mid range).
>> Indexing would probably be quite expensive since Lucene doesn't seem 
>> to support changes in the index, and the index for the terms would 
>> change all the time. We haven't implemented it yet, but it shouldn't 
>> be hard to code. I just wouldn't expect good performance when 
>> indexing large collections.
>>  Peter
>> Terry Steichen wrote:
>>> Is it possible without extensive additional coding to use Lucene to 
>>> conduct a search based on a document rather than a query?  (One use 
>>> of this would be to refine a search by selecting one of the hits 
>>> returned from the initial query and subsequently retrieving other 
>>> documents "like" the selected one.)
>>> Regards,
>>> Terry
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message