lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Position increment clarification?
Date Sun, 15 Sep 2013 11:21:24 GMT
Hi,
Using multiple fields is the preferred approach! Internally in the index this does the same
like a single field with some gaps in the positions.

All Tokenizers inside in Lucene *set* the position increment accordingly, but filters are
not required to read it (unless they change it somehow). The attribute is solely for the IndexWriter
when creating the index. To insert manual gaps without multiple fields you have to write an
own TokenFilter or use the deprecated PositionFilter one. But this is in general more work
and much more complicated and harder to understand than adding the same field multiple times.

The position increment gap is only respected by IndexWriter when indexing, TokenStreams don't
see it (because every field instance gets own TokenStream).

The default position increment gap of all Analyzers has a sensible value to prevent PhraseQueries
to match over 2 field instances. This is the main reason why the gap is there: prevent position-sensitive
queries to match across fields.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Alan Burlison [mailto:alan.burlison@gmail.com]
> Sent: Sunday, September 15, 2013 11:15 AM
> To: java-user@lucene.apache.org
> Subject: Position increment clarification?
> 
> Firstly, some context. I'm indexing a large set of mbox files which contain
> multiple email messages, each mbox file being related to a single issue. I'm
> therefore indexing each mbox as a single document, treating each individual
> mail as a section of the same document.
> 
> To control matching across mails I want to set the position increment.
> I'm trying to decide how best to do this - either by setting the increment
> between tokens within a single field or by using multiple instances of a field
> and setting the increment between each field instance.
> 
> Much of the information I've found related to position increments seems to
> refer to Lucene 3 and things seem to be quite a bit different in 4. I think I've
> figured out what is going on, but would appreciate someone confirming if I'm
> right or not.
> 
> It looks as if position increments can potentially occur in two places:
> 
> 1. Between each token in a field. It looks like the PositionIncrementAttribute
> can be used to to pass a value to the tokenizer that is processing a field.
> 
> 2. Between multiple instances of a given field within a document. It looks like
> the getPositionIncrementGap method on Analyzer can be overridden to set
> the position increment between each field instance.
> 
> However, from looking from the source, it appears that nearly all the
> tokenizers ignore any values passed in a PositionIncrementAttribute and only
> use PositionIncrementAttribute to notify other parts of the processing chain
> of the value they actually used (normally 1). There's a filter to manipulate
> inter-token positions (PositionFilter), but the documentation says this:
> 
> Deprecated.
> (4.4) PositionFilter makes TokenStream graphs inconsistent which can cause
> highlighting bugs.
> 
> All of which makes it seem that manipulating the inter-token position
> increment isn't particularly useful.
> 
> The second mechanism - overriding Analyzer.getPositionIncrementGap -
> does seem to work, but that obviously means putting each segment of the
> mbox file into a new field instance. Is that the preferred approach?
> 
> Thanks,
> 
> --
> Alan Burlison
> --
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message