Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of alan.burlison@gmail.com
 designates 74.125.82.180 as permitted sender)
Message-ID: <5235A252.1090303@gmail.com>
Date: Sun, 15 Sep 2013 13:04:34 +0100
From: Alan Burlison <alan.burlison@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: java-user@lucene.apache.org
CC: Uwe Schindler <uwe@thetaphi.de>
Subject: Re: Position increment clarification?
References: <52357A8C.7030302@gmail.com>
 <012b01ceb205$b7266350$257329f0$@thetaphi.de>
In-Reply-To: <012b01ceb205$b7266350$257329f0$@thetaphi.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 15/09/13 12:21, Uwe Schindler wrote:

> Using multiple fields is the preferred approach! Internally in the
> index this does the same like a single field with some gaps in the
> positions.

Right, thanks.

> All Tokenizers inside in Lucene *set* the position increment
> accordingly, but filters are not required to read it (unless they
> change it somehow). The attribute is solely for the IndexWriter when
> creating the index. To insert manual gaps without multiple fields you
> have to write an own TokenFilter or use the deprecated PositionFilter
> one. But this is in general more work and much more complicated and
> harder to understand than adding the same field multiple times.

That confirms what I'd thought based on a wander through the source. I'd 
read Lucene in Action and just got myself confused about what the best 
approach was.

> The position increment gap is only respected by IndexWriter when
> indexing, TokenStreams don't see it (because every field instance
> gets own TokenStream).

Yes, that makes sense.

> The default position increment gap of all Analyzers has a sensible
> value to prevent PhraseQueries to match over 2 field instances. This
> is the main reason why the gap is there: prevent position-sensitive
> queries to match across fields.

Are you sure? I see this in Analyzer.java:

* Invoked before indexing a IndexableField instance if
* terms have already been added to that field.  This allows custom
* analyzers to place an automatic position increment gap between
* IndexbleField instances using the same field name.  The default value
* position increment gap is 0.  With a 0 position increment gap and
* the typical default token position increment of 1, all terms in a field,
* including across IndexableField instances, are in successive 
positions, allowing
* exact PhraseQuery matches, for instance, across IndexableField 
instance boundaries.

and I can't find where any of the other analyzers override the 
getPositionIncrementGap method.

I've been using Luke to examine the generated index but I haven't been 
able to find a way to display the position value of each instance of a 
duplicated field so I wasn't quite sure if what I was doing was actually 
working.

-- 
Alan Burlison
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org