lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <soko...@ifactory.com>
Subject Re: proposed change to CharTokenizer
Date Sun, 17 Oct 2010 18:30:43 GMT
  OK - no responses to this, but in case you were curious...the patch I 
suggested won't work - so please don't install it :)

In the end I was able to get the behavior I wanted by fiddling with 
offsets in my CharFilter, but it requires detecting token boundaries in 
the CharFilter stage, which seems like abstraction leekage to me.  Maybe 
there's a better way?

-Mike

On 10/14/2010 12:08 PM, Mike Sokolov wrote:
> Background: I've been trying to enable hit highlighting of XML 
> documents in such a way that the highlighting preserves the 
> well-formedness of the XML.
>
> I thought I could get this to work by implementing a CharFilter that 
> extracts text from XML (somewhat like HTMLStripCharFilter, except I am 
> using an XML parser - however I think the concept is also applicable 
> to HTMLStripCharFilter) while preserving the offsets of the text in 
> the original XML document so as to enable highlighting.
>
> I ran into a problem in CharTokenizer.incrementToken(), which calls 
> correctOffset() as follows:
>
>     offsetAtt.setOffset(correctOffset(start), 
> correctOffset(start+length));
>
> The issue is that the end offset is computed as the offset of the 
> beginning of the *next* block of text rather than the offset of the 
> end of *this* block of text.
>
> In my test case:
>
> <p><b>bold text</b> regular text</p>
>
> I get tokens like this ([] showing token boundaries):
>
>                [bold] [text</b>][regular][text</p>]
>
> instead of:
>
>                [bold][text][regular][text]
>
> I don't think this problem can be fixed by jiggling offsets, or indeed 
> by wrapping or extending CharTokenizer in any straightforward way.  
> The fix I found is to change the line in 
> CharTokenizer.incrementToken() to:
>
>     offsetAtt.setOffset(correctOffset(start), 
> correctOffset(start+length-1)+1);
>
> Again, conceptually, this computes the corrected offset of the last 
> character in the token, and then marks the end of the token as the 
> immediately following position, rather than including all the garbage 
> characters in between the end of this token and the beginning of the 
> next.
>
> My impression is that this change should be completely 
> backwards-compatible since its behavior will be identical for 
> CharFilters that don't actually perform character deletion, and AFAICT 
> the only existing CharFilter performs replacements and expansions (of 
> ligatures and the like).  But my knowledge of Lucene is far from 
> comprehensive.
> Does this seem like a reasonable patch?
>
> -Mike
>
> Michael Sokolov
> Engineering Director
> www.ifactory.com
> @iFactoryBoston
>
> PubFactory: the revolutionary e-publishing platform from iFactory
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message