lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko" <ita...@divrei-tora.com>
Subject RE: setPositionIncrement questions
Date Sun, 11 May 2008 20:54:13 GMT

Chris,

I ended up hacking StandardTokenizer::next() to check for $^$^$, and if it
is there then set the current Token PositionIncrement to 500 and resume the
tokenizing loop (so the word which will be read into that Term will have
position increment of 500). As far as I can tell it is working well - how
can I check the terms positions in a document's field and see they have been
incremented indeed? I have tried Luke, but it doesn't seem to allow this. My
field is tokenized and not stored.

Itamar.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Sunday, March 30, 2008 8:56 AM
To: Lucene Users
Subject: Re: setPositionIncrement questions


: Breaking proximity data has been discussed several times before, and
: concluded that setPositionIncrement is the way to go. In regards of it:
: 
: 1. Where should it be called exactly to create the gap properly?

any part of your Analyzer can set the position increment on any token to
indicate how far after the previous token it should be.

: 2. Is there a way to call it directly somehow while indexing (e.g. after
: adding a new paragraph to an existing field) instead of appending $$$
: for example after the new string I'm indexing, and having to update my
: tokenizer and filters so they will retain the $$$ chars, indicating the
: gap request?

if you add multiple Fields with the same name, Lucene will call the
"getPositionIncrementGap" method on your Analyzer to determine how much
positionIncreiment to put in between the last token of the first Field and
the first token of the second Field -- so you could just pass each paragraph
as a seperate Field instance .. alternately you can have a single Field
instance, and your Analyzer can use whatever mechanims it wants to decide to
set the position incriment to something high (a line break, a magic char
sequence you put in the string, ... whatever you want)

: 3. What is the recommended value to pass setPositionIncrement to create
: a reasonable gap, and not risk large documents being indexed improperly
: (I mean, is there some sort of high-bound for the position value?).

MAX_INT .. pick gaps based on your data and the queries you expect (if you
want gaps betwen paragraps, and your paragraphs tend to be under 200 words
long, make the gap 500 so "lucene java"~300 can find those words in the same
paragram, but can never span multiple paragraphs

: 4. What are the consequences of setting PositionIncrement to 0? Does
: this mean I can index synonyms or stems aside of the "real" words
: without risking data corruption?

it means the words appear at the same position - synonyms is a great example
of this use case.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message