Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 28545 invoked from network); 5 Dec 2003 17:26:19 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 5 Dec 2003 17:26:19 -0000 Received: (qmail 30237 invoked by uid 500); 5 Dec 2003 17:26:10 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 30190 invoked by uid 500); 5 Dec 2003 17:26:10 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 30177 invoked from network); 5 Dec 2003 17:26:10 -0000 Received: from unknown (HELO rwcrmhc13.comcast.net) (204.127.198.39) by daedalus.apache.org with SMTP; 5 Dec 2003 17:26:10 -0000 Received: from lucene.com (c-24-5-145-151.client.comcast.net[24.5.145.151]) by comcast.net (rwcrmhc13) with SMTP id <2003120517261301500bt686e>; Fri, 5 Dec 2003 17:26:13 +0000 Message-ID: <3FD0BFB2.5070401@lucene.com> Date: Fri, 05 Dec 2003 09:26:10 -0800 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Position increment (tokens, DocumentWriter), max field length References: <1F93B54C-26D5-11D8-8DAC-000393A564E6@ehatchersolutions.com> <200312042226.35948.tatu@hypermall.net> In-Reply-To: <200312042226.35948.tatu@hypermall.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Tatu Saloranta wrote: > I have a question related to the way position increment is handled in > DocumentWriter's invertDocument (main tokenization/indexing method). It does > following: > > for (Token t = stream.next(); t != null; t = stream.next()) { > position += (t.getPositionIncrement() - 1); > addPosition(fieldName, t.termText(), position++); > if (position > maxFieldLength) break; > } > > If I'm not mistaken, this means that maxFieldLength comparison counts in > "holes" in token sequence. And such behaviour might be problematic, > especially if such holes are used to mark sentence/paragraph boundaries (to > reduce score or avoid hit for phrase queries), which was discussed recently. > Also, since that count is saved in index, such holes "bloat" perceived > document size, and thus reduce document's relative weight. > > It'd be easy to fix this to only count tokens (I can provide patch if so), but > I wanted to make sure I don't misunderstand something fundamental here? I think that sounds like a reasonable fix. The field length is also used for length normalization, and ignoring holes seems like the right thing there too. I don't think field length is used anywhere else, so such a change shouldn't break anything. Speaking of sentence/paragraph boundaries: Following is a proposal I just wrote. It also includes the PhraseQuery change I alluded to in my message yesterday to Erik. If this proposal is accepted, then I'll have funding to develop it ASAP. Does this sound like a reasonable approach to the problem? It adds no costs for folks who don't use the new features. When folks do use it, their index gets larger, but sentence- and paragraph- and section-type searches are exact and as efficient as ordinary phrase searches. Using position-increment is still possible: one can, e.g., increment by 100 for sentence, 1000 for paragraph and 10,000 for section. This does not increase index size nearly as much, but searches are not exact, and are slightly slower, since they must specify non-zero slop (slop=0 queries use a faster algorithm). Which approach is preferred? Would it be confusing to support both? Note that the relative term position in PhraseQuery feature is not dependent on the levels feature, and could be easily added right now. I hope this is not too confusing... Doug ------ MATCH WITHIN SENTENCE, PARAGRAPH, SECTION, ETC. The token class has a field called "type", which is used by analyzers to distinguish lexical types, but is currently ignored when indexing. We change this so that token types can name "levels". Each token type can be declared to define a unique level. There is also a default level for all undeclared token types. When indexing a token stream, a position counter is maintained for each level. Tokens at non-default levels are not indexed, and simply increment their counter. Tokens at the default level are both indexed and increment the default counter. Consider the following example. Types "sent" and "para" are declared as levels. Types "word" and "abbr" are undeclared, and hence default. The first two lines below are a sample sequence of token texts and types. The next two lines are the values of the counters when indexing this sequence of tokens. Tokens: text: The start . The USA . The end . type: word word sent word abbr para sent word word sent Counters: default: 0 1 1 2 3 3 3 4 5 5 sent: 0 0 0 1 1 1 1 2 2 2 para: 0 0 0 0 0 0 1 1 1 1 Thus the word "the" appears at default level positions 0, 2, and 4, and at "sent" level positions 0, 1, and 2, and at para level positions 0, 0, and 1. The following method is used to declare a level: IndexWriter.addLevel(String field, String type); Note that this must be performed each time an IndexWriter is constructed. If it is performed when some documents are added but not when others are, or if indexes are merged with different positioned tokens, then all positions will be zero for levels undeclared when indexing. An index maintains, for each field, the known levels, accessed through: public String[] IndexReader.positionedLevels(String field); The stream of positions at each level is accessed with: public TermPositions IndexReader.getPositions(Term term, String level); Note that the index size increases substantially for each level added. Each additional level adds around 50% to the default-only index size. Thus an index with two additional levels would require around 200% the size of an index with no additional levels. PhraseQuery's constructor takes a level for the match: public PhraseQuery(String level); Phrase queries are also permitted to explicitly specify the position of each term in the phrase. For example, if two phrase query terms are both at position=0, with phrase slop=0, then they must occur at exactly the same positions. And if they're at position=0 and position=1 respectively, with phrase slop=0, then they must occur adjacently, as an exact phrase. public PhraseQuery.add(Term term, int position); If, for example, sentence boundaries are at level=sent, then a phrase query at level=sent with slop=0 and position=0 for all terms would require that all query terms are in the same sentence. ------ --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org