lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: offsets
Date Sun, 29 Jul 2018 19:13:10 GMT
How would a fixup API work?  We would try to provide correctOffset
throughout the full analysis chain?

Mike McCandless

On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov <> wrote:

> I've run into some difficulties with offsets in some TokenFilters I've been
> writing, and I wonder if anyone can shed any light. Because characters may
> be inserted or removed by prior filters (eg ICUFoldingFilter does this with
> ellipses), and there is no offset-correcting data structure available to
> TokenFilters (as there is in CharFilter), there doesn't seem to be any
> reliable way to calculate the offset at a point interior to a token, which
> means that essentially the only reasonable thing to do with OffsetAttribute
> is to preserve the offsets from the input. This is means that filters that
> split their tokens (like WordDelimiterGraphFilter) have no reliable way of
> mapping their split tokens' offsets. One can try, but it seems inevitably
> to require making some arbitrary "fixup" stage in order to guarantee that
> the offsets are nondecreasing and properly bounded by the original text
> length.
> If this analysis is correct, it seems one should really never call
> OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
> method to provide. (hmm now I see this comment in OffsetAttributeImpl
> suggesting making the method call-once). If that really is the case, I
> think some assertion, deprecation, or other API protection would be useful
> so the policy is clear.
> Alternatively, do we want to consider providing a "fixup" API as we have
> for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
> API for setting offset deltas. This would make more precise highlighting
> possible in these cases, at least. I'm not sure what other use cases folks
> have come up with for offsets?
> -Mike

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message