lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Itamar Syn-Hershko" <ita...@divrei-tora.com>
Subject RE: setPositionIncrement questions
Date Mon, 31 Mar 2008 11:02:59 GMT

Chris,

Thanks for your input.

Please let me make sure that I get this right: while iterating through the
words in a document, I can use my tokenizer to setPositionIncrement(150) on
a specific token, what would make it be more distant from the previous token
than it should have been. The next token will already have position
increment of 1 and therefore will immediately follow that token, with no
extra handling. If I get this right, the best way to achieve that is by
appending a predefined string like $$$, such that will not occur accidently
in my documents, and have my tokenizer set the position increment as well
instead of just tokenizing upon it.

>>>  Lucene will call the "getPositionIncrementGap" method on your Analyzer
to determine how much positionIncreiment to put in between the last token of
the first Field and the first token of the second Field -- so you could just
pass each paragraph as a seperate Field instance

This sounds good, but is risky, since I will have to concatenate my
paragraphs that I DO want to have proximity data in between, and if I forget
to, or accidently don't do that this will corrupt proximity-based searches.
My documents can become very big as well. I guess what I was looking for was
a simpler way - say tell Lucene when I do doc.add(new Field) to set the
position increment for the last token. The "magic char sequence" will do,
but I was wondering if there is a way to do that without ammending my
Tokenizer?

>>> it means the words appear at the same position

... And what does this mean exactly? How can this affect standard searches?
What I might do with this is store stems side-by-side with the original
word. From what I've heard so far this is NOT how you do this for English
texts - you rather store them in a different field, why is that? I thought
if you store them side-by-side you could write a Scorer (or similar) that
will return all relevant results for the stem of a given word, boosting
words with the same exact syntax more than others. Any ideas on that?

Itamar.

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Sunday, March 30, 2008 8:56 AM
To: Lucene Users
Subject: Re: setPositionIncrement questions


: Breaking proximity data has been discussed several times before, and
: concluded that setPositionIncrement is the way to go. In regards of it:
: 
: 1. Where should it be called exactly to create the gap properly?

any part of your Analyzer can set the position increment on any token to
indicate how far after the previous token it should be.

: 2. Is there a way to call it directly somehow while indexing (e.g. after
: adding a new paragraph to an existing field) instead of appending $$$
: for example after the new string I'm indexing, and having to update my
: tokenizer and filters so they will retain the $$$ chars, indicating the
: gap request?

if you add multiple Fields with the same name, Lucene will call the
"getPositionIncrementGap" method on your Analyzer to determine how much
positionIncreiment to put in between the last token of the first Field and
the first token of the second Field -- so you could just pass each paragraph
as a seperate Field instance .. alternately you can have a single Field
instance, and your Analyzer can use whatever mechanims it wants to decide to
set the position incriment to something high (a line break, a magic char
sequence you put in the string, ... whatever you want)

: 3. What is the recommended value to pass setPositionIncrement to create
: a reasonable gap, and not risk large documents being indexed improperly
: (I mean, is there some sort of high-bound for the position value?).

MAX_INT .. pick gaps based on your data and the queries you expect (if you
want gaps betwen paragraps, and your paragraphs tend to be under 200 words
long, make the gap 500 so "lucene java"~300 can find those words in the same
paragram, but can never span multiple paragraphs

: 4. What are the consequences of setting PositionIncrement to 0? Does
: this mean I can index synonyms or stems aside of the "real" words
: without risking data corruption?

it means the words appear at the same position - synonyms is a great example
of this use case.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message