lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Similar Document Search
Date Tue, 19 Aug 2003 07:27:10 GMT
Hi Magnus,

thanks for the offer, but unfortunately I can't/don't want to make the 
assumption that I can easily access the documents to re-index them. And 
I don't think this approach would be feasible unless you can keep the 
documents in memory somehow.

Storing the other/non-inverted/normal/whatever index would be expensive 
for indexing, but querying should be a lot faster than having to 
re-index documents. That is in our situation preferable.

  Peter


Magnus Johansson wrote:

> Hi Peter
>
> If the original document is available. You could extract keywords from 
> the document
> at query time. That is when someone asks for documents similar to 
> document a. You
> re-analyze document a and in combination with statistics from the 
> Lucene index you extract
> keywords from document a that can then be used as a query for 
> findining similar documents.
>
> I've got some sample code if anyone is interested.
>
> /magnus
>
>
> Peter Becker wrote:
>
>> Hi Terry,
>>
>> we have been thinking about the same problem and in the end we 
>> decided that most likely the only good solution to this is to keep a 
>> non-inverted index, i.e. a map from the documents to the terms. Then 
>> you can query the most terms for the documents and query other 
>> documents matching parts of this (where you get the usual question of 
>> what is actually interesting: high frequency, low frequency or the 
>> mid range).
>>
>> Indexing would probably be quite expensive since Lucene doesn't seem 
>> to support changes in the index, and the index for the terms would 
>> change all the time. We haven't implemented it yet, but it shouldn't 
>> be hard to code. I just wouldn't expect good performance when 
>> indexing large collections.
>>
>>  Peter
>>
>>
>> Terry Steichen wrote:
>>
>>> Is it possible without extensive additional coding to use Lucene to 
>>> conduct a search based on a document rather than a query?  (One use 
>>> of this would be to refine a search by selecting one of the hits 
>>> returned from the initial query and subsequently retrieving other 
>>> documents "like" the selected one.)
>>>
>>> Regards,
>>>
>>> Terry
>>>
>>>  
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




Mime
View raw message