lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Johansson <mag...@technohuman.com>
Subject Re: Similar Document Search
Date Tue, 19 Aug 2003 07:33:46 GMT
Hi Peter,

I guess you are right.

I've implemented this for a index with ten millions of really small
documents that all are stored in the index. The documents are never more 
than a thousand
words so re-indexing is quick enough. However it is probably not 
advisable to do
this with bigger documents or documents that need additional parsing.

/magnus


Peter Becker wrote:

> Hi Magnus,
>
> thanks for the offer, but unfortunately I can't/don't want to make the 
> assumption that I can easily access the documents to re-index them. 
> And I don't think this approach would be feasible unless you can keep 
> the documents in memory somehow.
>
> Storing the other/non-inverted/normal/whatever index would be 
> expensive for indexing, but querying should be a lot faster than 
> having to re-index documents. That is in our situation preferable.
>
>  Peter
>
>
> Magnus Johansson wrote:
>
>> Hi Peter
>>
>> If the original document is available. You could extract keywords 
>> from the document
>> at query time. That is when someone asks for documents similar to 
>> document a. You
>> re-analyze document a and in combination with statistics from the 
>> Lucene index you extract
>> keywords from document a that can then be used as a query for 
>> findining similar documents.
>>
>> I've got some sample code if anyone is interested.
>>
>> /magnus
>>
>>



Mime
View raw message