lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
Date Wed, 22 Jul 2009 07:19:14 GMT


Michael Busch commented on LUCENE-1448:

OK I think I have this basically working with old and new API (including 1693 changes).

The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the
following method to TokenSteam:

   * This method is called by the consumer after the last token has been consumed, 
   * i.e. after {@link #incrementToken()} returned <code>false</code> (using the
new TokenStream API)
   * or after {@link #next(Token)} or {@link #next()} returned <code>null</code>
(old TokenStream API).
   * <p/>
   * This method can be used to perform any end-of-stream operations, such as setting the
   * offset of a stream. The final offset of a stream might differ from the offset of the
last token
   * e.g. in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
   * was used.
   * <p/>
   * @throws IOException
  public void end() throws IOException {
    // do nothing by default

Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset().

E.g. in CharTokenizer the implementations looks like this:

  public void end() {
    // set final offset
    int finalOffset = input.correctOffset(offset);
    offsetAtt.setOffset(finalOffset, finalOffset);

I changed DocInverterPerField to call end() after the stream is fully consumed and use what

offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

> add getFinalOffset() to TokenStream
> -----------------------------------
>                 Key: LUCENE-1448
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
> If you add multiple Fieldable instances for the same field name to a document, and you
then index those fields with TermVectors storing offsets, it's very likely the offsets for
all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each
field instance, where that base is 1 + the endOffset of the last token it saw when analyzing
that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being
used, and the text being analyzed ended in 3 whitespace characters, then that information
is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter
appears in the chain, and the last N tokens were stop words, then the base will be 1 + the
endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by
default it returns -1, which means "I don't know so you figure it out", meaning we fallback
to the faulty logic we have today.
> This has come up several times on the user's list.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message