lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Similar Document Search
Date Thu, 21 Aug 2003 05:37:30 GMT
Hi all,

it seems there are quite a few people looking for similar features, i.e. 
(a) document identity and (b) forward indexing. So far we hacked (a) by 
using a wrapper implementing equals/hashcode based on a unique field, 
but of course that assumes maintaining a unique field in the index. (b) 
is something we haven't tackled yet, but plan to.

The source code for Mark's thesis seems to be part of the Haystack 
distribution. The comments in the files put it under Apche-license. This 
seems to make it a good candidate to be included at least in the Lucene 
sandbox -- although I haven't tried it myself yet. But it sounds like a 
good candidate for us to use.

Since the haystack source is a bit larger and I actually couldn't get 
the download at the moment, here is a copy of the relevant bit grabbed 
from one of my colleague's machines:

  http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

Note that this is just a tarball of src/org/apache/lucene out of some 
Haystack source. Untested, unmodified.

I'd love to see something like this supported in the Lucene context were 
people might actually find it :-)

  Peter


Gregor Heinrich wrote:

>Hello Terry,
>
>Lucene can do forward indexing, as Mark Rosen outlines in his Master's
>thesis: http://citeseer.nj.nec.com/rosen03email.html.
>
>We use a similar approach for (probabilistic) latent semantic analysis and
>vector space searches. However, the solution is not really completely fixed
>yet, therefore no code at this time...
>
>Best regards,
>
>Gregor
>
>
>
>
>-----Original Message-----
>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>Sent: Tuesday, August 19, 2003 3:06 AM
>To: Lucene Users List
>Subject: Re: Similar Document Search
>
>
>Hi Terry,
>
>we have been thinking about the same problem and in the end we decided
>that most likely the only good solution to this is to keep a
>non-inverted index, i.e. a map from the documents to the terms. Then you
>can query the most terms for the documents and query other documents
>matching parts of this (where you get the usual question of what is
>actually interesting: high frequency, low frequency or the mid range).
>
>Indexing would probably be quite expensive since Lucene doesn't seem to
>support changes in the index, and the index for the terms would change
>all the time. We haven't implemented it yet, but it shouldn't be hard to
>code. I just wouldn't expect good performance when indexing large
>collections.
>
>  Peter
>
>
>Terry Steichen wrote:
>
>  
>
>>Is it possible without extensive additional coding to use Lucene to conduct
>>    
>>
>a search based on a document rather than a query?  (One use of this would be
>to refine a search by selecting one of the hits returned from the initial
>query and subsequently retrieving other documents "like" the selected one.)
>  
>
>>Regards,
>>
>>Terry
>>
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



Mime
View raw message