Return-Path: Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 58346 invoked from network); 10 Oct 2001 16:34:48 -0000 Received: from relay3.uswest.net (HELO relay1.uswest.net) (63.226.138.11) by daedalus.apache.org with SMTP; 10 Oct 2001 16:34:48 -0000 Received: (qmail 69030 invoked by uid 0); 10 Oct 2001 16:34:50 -0000 Received: from unknown (HELO earthlink.net) (65.100.117.194) by relay3.uswest.net with SMTP; 10 Oct 2001 16:34:50 -0000 Message-ID: <3BC4789B.8010309@earthlink.net> Date: Wed, 10 Oct 2001 10:34:35 -0600 From: Dmitry Serebrennikov User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.2) Gecko/20010726 Netscape6/6.1 X-Accept-Language: en-us MIME-Version: 1.0 To: lucene-dev@jakarta.apache.org Subject: Re: Token retrieval question References: <571B91EF5DD1D211A04900A0C9EAD2C8010FA910@coi01.coi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I'm actually working on exactly the same problem. Just yesterday, I implemented a new query (called CooccuranceQuery) that, given a list of terms, acts as a BooleanQuery with all of the terms being required and then reports back a list of other terms in the index with a count of how many documents contain each one in the result set (it actually returns a TermEnum-like object). There are, of course, a couple of problems with this. First, as you mentioned, this is not a reasonable solution for an index with a large number of unique terms. The number of documents doesn't have as much effect because scanning through documents without retrieving them is fast. However, each term in the index (as reported by reader.terms) has to have a TermEnum (or a TermPositions) object. This can quickly get out of hand. I tried it on an 8000-term index and the performance seemed pretty good, but once you get up to 30,000... Another problem with this approach is that MultiSearcher does not provide easy access to the terms in a the combined index. This I could solve pretty easily, but it seems that this approach won't scale anyway, so I'm not doing this yet. Finally, a bigger problem, is that even if we were to add some kind of Reader.terms(doc, term) method that would list the terms of a particular document starting with specified term, we would still get *stemmed* forms of these terms. In an application that wants to display these to the user in some way, this will not be acceptable because the stems are not always complete words (even in English, I don't even know what they will be in another language). This, of course, has to do with Lucene's architecture where Analyzer is separated from indexing so that the index never sees the original word forms. The only way to solve this that I see right now is to store a dictionary of "stem, [form1, form2, ...]" for each term in the index externally. Also store a mapping "doc, [stem1, stem2, ...]" that would be the document's term vector. For the term dictionary, there simply isn't any place in Lucene that could store it. For the document's term vector, this can be stored in Lucene if we create a new datastructure on disk for it. Finally, storing ther term vector in the document itself leads to very slow processing because now documents must be retrieved and this field re-parsed. Anyway, is there anyone else working on a related problem? Should we collaborate? -dmitry Nestel, Frank wrote: >Hi, > >I've been reading the API and I couldn't figure out a >nice and fast way to solve the following problem: > >I'd like to enumerate the tokens of a document (or >document field). Do the internal datastructures >of lucene allow such kind of traversal which is (as >I understand) of course orthogonal to the access lucene >is optimized for? > >More concrete I have s.th. like 20-50 tokens/words and one >document and I'd like to ask the document if (and how often) >it contains those particular tokens. The idea was to augment >search results with (kind of I know) automatic query >dependand keywords. > >The only way I see right now is to create 20-50 TermEnums >and walk through them until I end up in my document or >nowhere? Which is probably not feasible for a search result >page with (say) 20 hits in a larger index. > >Any (more elegant) chance, I missed? > >Thank you, >Frank > >-- >Dr. Frank Sven Nestel >Principal Software Engineer > >COI GmbH Erlanger Stra�e 62, D-91074 Herzogenaurach >Phone +49 (0) 9132 82 4611 >http://www.coi.de, mailto:Frank.Nestel@coi.de > COI - Solutions for Documents > >