Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 18462 invoked from network); 11 Nov 2008 14:46:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Nov 2008 14:46:07 -0000 Received: (qmail 56908 invoked by uid 500); 11 Nov 2008 14:46:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 56860 invoked by uid 500); 11 Nov 2008 14:46:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 56848 invoked by uid 99); 11 Nov 2008 14:46:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Nov 2008 06:46:12 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Nov 2008 14:45:00 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 375FE234C286 for ; Tue, 11 Nov 2008 06:45:44 -0800 (PST) Message-ID: <2034621936.1226414744214.JavaMail.jira@brutus> Date: Tue, 11 Nov 2008 06:45:44 -0800 (PST) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream In-Reply-To: <1070615160.1226395724308.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646549#action_12646549 ] markrmiller@gmail.com edited comment on LUCENE-1448 at 11/11/08 6:44 AM: --------------------------------------------------------------- You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index. Would be cool to have fixed as this also stymies highlighting with multi-fields. *edit* I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1. *edit2* Nm :) I am a bad counter. I think you only need the +1 back. was (Author: markrmiller@gmail.com): You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index. Would be cool to have fixed as this also stymies highlighting with multi-fields. *edit* I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1. > add getFinalOffset() to TokenStream > ----------------------------------- > > Key: LUCENE-1448 > URL: https://issues.apache.org/jira/browse/LUCENE-1448 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1448.patch > > > If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. > This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. > But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. > To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today. > This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org