Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 77988 invoked from network); 26 Sep 2010 13:08:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Sep 2010 13:08:02 -0000 Received: (qmail 27074 invoked by uid 500); 26 Sep 2010 13:08:00 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 25948 invoked by uid 500); 26 Sep 2010 13:07:58 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 25787 invoked by uid 99); 26 Sep 2010 13:07:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Sep 2010 13:07:57 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Sep 2010 13:07:55 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o8QD7XV5011591 for ; Sun, 26 Sep 2010 13:07:33 GMT Message-ID: <22421844.406871285506453638.JavaMail.jira@thor> Date: Sun, 26 Sep 2010 09:07:33 -0400 (EDT) From: "Michael McCandless (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2668) offset gap should be added regardless of existence of tokens in DocInverterPerField In-Reply-To: <23993487.401611285437874474.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914969#action_12914969 ] Michael McCandless commented on LUCENE-2668: -------------------------------------------- +1 But, what about index back compat? Should we switch this under Version? Or do we think apps are not relying on this quirky behavior? In the future, eg w/ write-once attr bindings in the analysis chain (LUCENE-2450), which lets us fully decouple analysis and indexing, how pos/offset gaps are added for multi-valued fields will be fully under the analyzer's control... > offset gap should be added regardless of existence of tokens in DocInverterPerField > ----------------------------------------------------------------------------------- > > Key: LUCENE-2668 > URL: https://issues.apache.org/jira/browse/LUCENE-2668 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0 > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LUCENE-2668.patch, LUCENE-2668.patch, Test.java > > > Problem: If a multiValued field which contains a stop word (e.g. "will" in the following sample) only value is analyzed by StopAnalyzer when indexing, the offsets of the subsequent tokens are not correct. > {code:title=indexing a multiValued field} > doc.add( new Field( F, "Mike", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) ); > doc.add( new Field( F, "will", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) ); > doc.add( new Field( F, "use", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) ); > doc.add( new Field( F, "Lucene", Store.YES, Index.ANALYZED, TermVector.WITH_OFFSETS ) ); > {code} > In this program (soon to be attached), if you use WhitespaceAnalyzer, you'll get the offset(start,end) for "use" and "Lucene" will be use(10,13) and Lucene(14,20). But if you use StopAnalyzer, the offsets will be use(9,12) and lucene(13,19). When searching, since searcher cannot know what analyzer was used at indexing time, this problem causes out of alignment of FVH. > Cause of the problem: StopAnalyzer filters out "will", anyToken flag set to false then offset gap is not added in DocInverterPerField: > {code:title=DocInverterPerField.java} > if (anyToken) > fieldState.offset += docState.analyzer.getOffsetGap(field); > {code} > I don't understand why the condition is there... If always the gap is added, I think things are simple. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org