lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Burlison <alan.burli...@gmail.com>
Subject Position increment clarification?
Date Sun, 15 Sep 2013 09:14:52 GMT
Firstly, some context. I'm indexing a large set of mbox files which 
contain multiple email messages, each mbox file being related to a 
single issue. I'm therefore indexing each mbox as a single document, 
treating each individual mail as a section of the same document.

To control matching across mails I want to set the position increment. 
I'm trying to decide how best to do this - either by setting the 
increment between tokens within a single field or by using multiple 
instances of a field and setting the increment between each field instance.

Much of the information I've found related to position increments seems 
to refer to Lucene 3 and things seem to be quite a bit different in 4. I 
think I've figured out what is going on, but would appreciate someone 
confirming if I'm right or not.

It looks as if position increments can potentially occur in two places:

1. Between each token in a field. It looks like the 
PositionIncrementAttribute can be used to to pass a value to the 
tokenizer that is processing a field.

2. Between multiple instances of a given field within a document. It 
looks like the getPositionIncrementGap method on Analyzer can be 
overridden to set the position increment between each field instance.

However, from looking from the source, it appears that nearly all the 
tokenizers ignore any values passed in a PositionIncrementAttribute and 
only use PositionIncrementAttribute to notify other parts of the 
processing chain of the value they actually used (normally 1). There's a 
filter to manipulate inter-token positions (PositionFilter), but the 
documentation says this:

Deprecated.
(4.4) PositionFilter makes TokenStream graphs inconsistent which can 
cause highlighting bugs.

All of which makes it seem that manipulating the inter-token position 
increment isn't particularly useful.

The second mechanism - overriding Analyzer.getPositionIncrementGap - 
does seem to work, but that obviously means putting each segment of the 
mbox file into a new field instance. Is that the preferred approach?

Thanks,

-- 
Alan Burlison
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message