Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <2034621936.1226414744214.JavaMail.jira@brutus>
Date: Tue, 11 Nov 2008 06:45:44 -0800 (PST)
From: "Mark Miller (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to
 TokenStream
In-Reply-To: <1070615160.1226395724308.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646549#action_12646549 ] 

markrmiller@gmail.com edited comment on LUCENE-1448 at 11/11/08 6:44 AM:
---------------------------------------------------------------

You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

*edit*

I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.

*edit2*

Nm :) I am a bad counter. I think you only need the +1 back.

      was (Author: markrmiller@gmail.com):
    You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

*edit*

I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.
  
> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org