lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Similar Document Search
Date Thu, 21 Aug 2003 12:53:33 GMT
Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pbecker@dstc.edu.au>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-----Original Message-----
> >From: Peter Becker [mailto:pbecker@dstc.edu.au]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


Mime
View raw message