lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
Date Wed, 22 Jul 2009 07:19:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734022#action_12734022
] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

OK I think I have this basically working with old and new API (including 1693 changes).

The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the
following method to TokenSteam:

{code:java}
  /**
   * This method is called by the consumer after the last token has been consumed, 
   * i.e. after {@link #incrementToken()} returned <code>false</code> (using the
new TokenStream API)
   * or after {@link #next(Token)} or {@link #next()} returned <code>null</code>
(old TokenStream API).
   * <p/>
   * This method can be used to perform any end-of-stream operations, such as setting the
final
   * offset of a stream. The final offset of a stream might differ from the offset of the
last token
   * e.g. in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
   * was used.
   * <p/>
   * 
   * @throws IOException
   */
  public void end() throws IOException {
    // do nothing by default
  }
{code}

Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset().

E.g. in CharTokenizer the implementations looks like this:

{code:java}
  public void end() {
    // set final offset
    int finalOffset = input.correctOffset(offset);
    offsetAtt.setOffset(finalOffset, finalOffset);
  }
{code}

I changed DocInverterPerField to call end() after the stream is fully consumed and use what

offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you
then index those fields with TermVectors storing offsets, it's very likely the offsets for
all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each
field instance, where that base is 1 + the endOffset of the last token it saw when analyzing
that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being
used, and the text being analyzed ended in 3 whitespace characters, then that information
is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter
appears in the chain, and the last N tokens were stop words, then the base will be 1 + the
endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by
default it returns -1, which means "I don't know so you figure it out", meaning we fallback
to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message