Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 52150 invoked from network); 14 Oct 2010 16:06:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 14 Oct 2010 16:06:39 -0000 Received: (qmail 4004 invoked by uid 500); 14 Oct 2010 16:06:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 3966 invoked by uid 500); 14 Oct 2010 16:06:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 3958 invoked by uid 99); 14 Oct 2010 16:06:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Oct 2010 16:06:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sokolov@ifactory.com designates 68.236.111.2 as permitted sender) Received: from [68.236.111.2] (HELO camelot.ifactory.com) (68.236.111.2) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Oct 2010 16:06:30 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by camelot.ifactory.com (Postfix) with ESMTP id 3726B3672670 for ; Thu, 14 Oct 2010 12:06:09 -0400 (EDT) Received: from camelot.ifactory.com ([127.0.0.1]) by localhost (camelot.ifactory.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 727lTBtfzqNa for ; Thu, 14 Oct 2010 12:06:08 -0400 (EDT) Received: from aix.ifactory.com (aix.ifactory.com [192.168.10.27]) by camelot.ifactory.com (Postfix) with ESMTPA id 0167B36725C7 for ; Thu, 14 Oct 2010 12:06:08 -0400 (EDT) Message-ID: <4CB72AE2.7050701@ifactory.com> Date: Thu, 14 Oct 2010 12:08:02 -0400 From: Mike Sokolov User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: proposed change to CharTokenizer Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Background: I've been trying to enable hit highlighting of XML documents in such a way that the highlighting preserves the well-formedness of the XML. I thought I could get this to work by implementing a CharFilter that extracts text from XML (somewhat like HTMLStripCharFilter, except I am using an XML parser - however I think the concept is also applicable to HTMLStripCharFilter) while preserving the offsets of the text in the original XML document so as to enable highlighting. I ran into a problem in CharTokenizer.incrementToken(), which calls correctOffset() as follows: offsetAtt.setOffset(correctOffset(start), correctOffset(start+length)); The issue is that the end offset is computed as the offset of the beginning of the *next* block of text rather than the offset of the end of *this* block of text. In my test case:

bold text regular text

I get tokens like this ([] showing token boundaries): [bold] [text][regular][text

] instead of: [bold][text][regular][text] I don't think this problem can be fixed by jiggling offsets, or indeed by wrapping or extending CharTokenizer in any straightforward way. The fix I found is to change the line in CharTokenizer.incrementToken() to: offsetAtt.setOffset(correctOffset(start), correctOffset(start+length-1)+1); Again, conceptually, this computes the corrected offset of the last character in the token, and then marks the end of the token as the immediately following position, rather than including all the garbage characters in between the end of this token and the beginning of the next. My impression is that this change should be completely backwards-compatible since its behavior will be identical for CharFilters that don't actually perform character deletion, and AFAICT the only existing CharFilter performs replacements and expansions (of ligatures and the like). But my knowledge of Lucene is far from comprehensive. Does this seem like a reasonable patch? -Mike Michael Sokolov Engineering Director www.ifactory.com @iFactoryBoston PubFactory: the revolutionary e-publishing platform from iFactory --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org