lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Question about startOffset and endOffset
Date Mon, 12 May 2008 16:23:02 GMT
Is this a theoretical question or is there a use-case you're trying
to support? If the latter, a statement of the problem you're trying
to solve would be helpful.

If the former, setting all your start offsets to 0 seems wrong. You're
essentially saying that all tokens are at the beginning of the document
and each one is some ever-increasing distance from the start. Some
of your later words will look really, really long (like the length
of your document).

Offhand, I expect this will affect up span queries, phrase
queries, and who knows what else? Maybe scoring?
Or maybe not, maybe some (or all) of these work off of
termpositions rather than offsets. So unless you're trying to
accomplish something specific, I sure wouldn't do this
given the problems it *could* create.

That said, I'm not much of an expert on how the offsets are
used, but I'd be really leery of changing them "just for the fun of
it" <G>......

Best
Erick

On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
brendan.grainger@gmail.com> wrote:

> Hi,
>
> I have a TokenStream that inserts synonym tokens into the stream when
> matched. One thing I am wondering about is what is the effect of the
> startOffset and endOffset. I have something like this:
>
> Token synonymToken = new Token(originalToken.startOffset(),
> originalToken.endOffset(), "SYNONYM");
> synonymToken.setPositionIncrement(0);
>
> What I am wondering is if I set the startOffset and endOffset to 0 and the
> endOffset to the length of the synonym string what effect will this have?
>
> eg
> Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
> synonymToken.setPositionIncrement(0);
>
> Thanks
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message