Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Message-ID: <3BC4789B.8010309@earthlink.net>
Date: Wed, 10 Oct 2001 10:34:35 -0600
From: Dmitry Serebrennikov <dmitrys@earthlink.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:0.9.2) Gecko/20010726 Netscape6/6.1
MIME-Version: 1.0
To: lucene-dev@jakarta.apache.org
Subject: Re: Token retrieval question
References: <571B91EF5DD1D211A04900A0C9EAD2C8010FA910@coi01.coi.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

I'm actually working on exactly the same problem. Just yesterday, I 
implemented a new query (called CooccuranceQuery) that, given a list of 
terms, acts as a BooleanQuery with all of the terms being required and 
then reports back a list of other terms in the index with a count of how 
many documents contain each one in the result set (it actually returns a 
TermEnum-like object). There are, of course, a couple of problems with 
this. First, as you mentioned, this is not a reasonable solution for an 
index with a large number of unique terms. The number of documents 
doesn't have as much effect because scanning through documents without 
retrieving them is fast. However, each term in the index (as reported by 
reader.terms) has to have a TermEnum (or a TermPositions) object. This 
can quickly get out of hand. I tried it on an 8000-term index and the 
performance seemed pretty good, but once you get up to 30,000...

Another problem with this approach is that MultiSearcher does not 
provide easy access to the terms in a the combined index. This I could 
solve pretty easily, but it seems that this approach won't scale anyway, 
so I'm not doing this yet.

Finally, a bigger problem, is that even if we were to add some kind of 
Reader.terms(doc, term) method that would list the terms of a particular 
document starting with specified term, we would still get *stemmed* 
forms of these terms. In an application that wants to display these to 
the user in some way, this will not be acceptable because the stems are 
not always complete words (even in English, I don't even know what they 
will be in another language). This, of course, has to do with Lucene's 
architecture where Analyzer is separated from indexing so that the index 
never sees the original word forms.

The only way to solve this that I see right now is to store a dictionary 
of "stem, [form1, form2, ...]" for each term in the index externally. 
Also store a mapping "doc, [stem1, stem2, ...]" that would be the 
document's term vector. For the term dictionary, there simply isn't any 
place in Lucene that could store it. For the document's term vector, 
this can be stored in Lucene if we create a new datastructure on disk 
for it. Finally, storing ther term vector in the document itself leads 
to very slow processing because now documents must be retrieved and this 
field re-parsed.

Anyway, is there anyone else working on a related problem? Should we 
collaborate?

-dmitry


Nestel, Frank wrote:

>Hi,
>
>I've been reading the API and I couldn't figure out a
>nice and fast way to solve the following problem:
>
>I'd like to enumerate the tokens of a document (or 
>document field). Do the internal datastructures
>of lucene allow such kind of traversal which is (as
>I understand) of course orthogonal to the access lucene 
>is optimized for? 
>
>More concrete I have s.th. like 20-50 tokens/words and one
>document and I'd like to ask the document if (and how often)
>it contains those particular tokens. The idea was to augment
>search results with (kind of I know) automatic query
>dependand keywords.
>
>The only way I see right now is to create 20-50 TermEnums
>and walk through them until I end up in my document or
>nowhere? Which is probably not feasible for a search result
>page with (say) 20 hits in a larger index.
>
>Any (more elegant) chance, I missed?
>
>Thank you,
>Frank
>
>--
>Dr. Frank Sven Nestel
>Principal Software Engineer
>
>COI GmbH    Erlanger Stra�e 62, D-91074 Herzogenaurach
>Phone +49 (0) 9132 82 4611 
>http://www.coi.de, mailto:Frank.Nestel@coi.de
>          COI - Solutions for Documents
>
>