Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 102B81081F for ; Sun, 15 Sep 2013 12:05:10 +0000 (UTC) Received: (qmail 45954 invoked by uid 500); 15 Sep 2013 12:05:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 45926 invoked by uid 500); 15 Sep 2013 12:05:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45918 invoked by uid 99); 15 Sep 2013 12:05:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Sep 2013 12:05:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alan.burlison@gmail.com designates 74.125.82.180 as permitted sender) Received: from [74.125.82.180] (HELO mail-we0-f180.google.com) (74.125.82.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Sep 2013 12:04:56 +0000 Received: by mail-we0-f180.google.com with SMTP id u57so2606926wes.25 for ; Sun, 15 Sep 2013 05:04:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=vQoa5zrukJEXwkN46W1TrFu1ootPfgAkzVe+JaeZmIU=; b=erPSxSySYynLd9LLuVbeTzqDqAnHIKBsh/sITpOrbIBH9Z61TQIDVU6KLySJs5nfgZ Zx26FQWNqs5OQkrY9WOw0NwY1WEWfc4WKNXzod9YSlkckGQmI5W+jnhDWidYwZcP3XCN jKE9ax9jgEvVBDc3iosAYHC5B7titW65ZsEuWRNi5nUd/ausFOEQLO2tHh9J1kNF70Qj pLYWswc5wveI1v5IYWRJWQt5qfZk3B2cjlSWhHBgCfOdNHbahBKAXx26l4dY5lPIK9Cs 1xx/zOnXh7UF7TkJ+sKcXLRXf1RnI9E60ZagFaayIQNvzPRq0zEIiNKSr+rekfZK7aBr LdDQ== X-Received: by 10.194.240.129 with SMTP id wa1mr18074229wjc.31.1379246675685; Sun, 15 Sep 2013 05:04:35 -0700 (PDT) Received: from [192.168.2.100] (host81-149-45-14.in-addr.btopenworld.com. [81.149.45.14]) by mx.google.com with ESMTPSA id dq11sm16092611wid.3.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 15 Sep 2013 05:04:35 -0700 (PDT) Message-ID: <5235A252.1090303@gmail.com> Date: Sun, 15 Sep 2013 13:04:34 +0100 From: Alan Burlison User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: java-user@lucene.apache.org CC: Uwe Schindler Subject: Re: Position increment clarification? References: <52357A8C.7030302@gmail.com> <012b01ceb205$b7266350$257329f0$@thetaphi.de> In-Reply-To: <012b01ceb205$b7266350$257329f0$@thetaphi.de> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 15/09/13 12:21, Uwe Schindler wrote: > Using multiple fields is the preferred approach! Internally in the > index this does the same like a single field with some gaps in the > positions. Right, thanks. > All Tokenizers inside in Lucene *set* the position increment > accordingly, but filters are not required to read it (unless they > change it somehow). The attribute is solely for the IndexWriter when > creating the index. To insert manual gaps without multiple fields you > have to write an own TokenFilter or use the deprecated PositionFilter > one. But this is in general more work and much more complicated and > harder to understand than adding the same field multiple times. That confirms what I'd thought based on a wander through the source. I'd read Lucene in Action and just got myself confused about what the best approach was. > The position increment gap is only respected by IndexWriter when > indexing, TokenStreams don't see it (because every field instance > gets own TokenStream). Yes, that makes sense. > The default position increment gap of all Analyzers has a sensible > value to prevent PhraseQueries to match over 2 field instances. This > is the main reason why the gap is there: prevent position-sensitive > queries to match across fields. Are you sure? I see this in Analyzer.java: * Invoked before indexing a IndexableField instance if * terms have already been added to that field. This allows custom * analyzers to place an automatic position increment gap between * IndexbleField instances using the same field name. The default value * position increment gap is 0. With a 0 position increment gap and * the typical default token position increment of 1, all terms in a field, * including across IndexableField instances, are in successive positions, allowing * exact PhraseQuery matches, for instance, across IndexableField instance boundaries. and I can't find where any of the other analyzers override the getPositionIncrementGap method. I've been using Luke to examine the generated index but I haven't been able to find a way to display the position value of each instance of a duplicated field so I wasn't quite sure if what I was doing was actually working. -- Alan Burlison -- --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org