lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: setPositionIncrement questions
Date Sun, 30 Mar 2008 05:55:54 GMT

: Breaking proximity data has been discussed several times before, and 
: concluded that setPositionIncrement is the way to go. In regards of it:
: 1. Where should it be called exactly to create the gap properly?

any part of your Analyzer can set the position increment on any token to 
indicate how far after the previous token it should be.

: 2. Is there a way to call it directly somehow while indexing (e.g. after 
: adding a new paragraph to an existing field) instead of appending $$$ 
: for example after the new string I'm indexing, and having to update my 
: tokenizer and filters so they will retain the $$$ chars, indicating the 
: gap request?

if you add multiple Fields with the same name, Lucene will call the  
"getPositionIncrementGap" method on your Analyzer to determine how much 
positionIncreiment to put in between the last token of the first Field and 
the first token of the second Field -- so you could just pass each 
paragraph as a seperate Field instance .. alternately you can have a 
single Field instance, and your Analyzer can use whatever mechanims it 
wants to decide to set the position incriment to something high (a line 
break, a magic char sequence you put in the string, ... whatever you want)

: 3. What is the recommended value to pass setPositionIncrement to create 
: a reasonable gap, and not risk large documents being indexed improperly 
: (I mean, is there some sort of high-bound for the position value?).

MAX_INT .. pick gaps based on your data and the queries you expect (if you 
want gaps betwen paragraps, and your paragraphs tend to be under 200 words 
long, make the gap 500 so "lucene java"~300 can find those words in 
the same paragram, but can never span multiple paragraphs

: 4. What are the consequences of setting PositionIncrement to 0? Does 
: this mean I can index synonyms or stems aside of the "real" words 
: without risking data corruption?

it means the words appear at the same position - synonyms is a great 
example of this use case.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message