lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: offsets
Date Wed, 01 Aug 2018 15:57:59 GMT
Given that character transformations do happen in TokenFilters, shouldn't
we strive to have an API that supports correct offsets (ie highlighting)
for any combination of token filters? Currently we can't do that. For
example because of the current situation, WordDelimiterGraphFilter,
decompounding filters and the like cannot assign offsets correctly, so eg
it becomes impossible to highlight the text that exactly corresponds to the
user query.

Just one example, if I have URLs in some document text, and analysis chain
is Whitespace tokenizer followed by WordDelimiterGraphFilter, then a query
for "http" will end up highlighting the entire URL.

Do you have an idea how we can address this without making our apis crazy?
Or are you just saying we should live with it as it is?


On Tue, Jul 31, 2018 at 6:36 AM Robert Muir <> wrote:

> The problem is not a performance one, its a complexity thing. Really I
> think only the tokenizer should be messing with the offsets...
> They are the ones actually parsing the original content so it makes
> sense they would produce the pointers back to them.
> I know there are some tokenfilters out there trying to be tokenizers,
> but we don't need to make our apis crazy to support that.
> On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov <>
> wrote:
> > Yes, in fact Tokenizer already provides correctOffset which just
> delegates
> > to CharFilter. We could expand on this, moving correctOffset up to
> > TokenStream, and also adding correct() so that TokenFilters can add to
> the
> > character offset data structure (two int arrays) and share it across the
> > analysis chain.
> >
> > Implementation-wise this could continue to delegate to CharFilter I
> guess,
> > but I think it would be better to add a character-offset-map abstraction
> > that wraps the two int arrays and provides the correct/correctOffset
> > methods to both TokenStream and CharFilter.
> >
> > This would let us preserve correct offsets in the face of manipulations
> > like replacing ellipses, ligatures (like AE, OE), trademark symbols
> > (replaced by "tm") and the like so that we can have the invariant that
> > correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length()
> ==
> > correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
> > token-splitting with correct offsets.
> >
> > I can work up a proof of concept; I don't think it would be too
> > API-intrusive or change performance in a significant way.  Only
> > TokenFilters that actually care about this (ie that insert or remove
> > characters, or split tokens) would need to change; others would continue
> to
> > work as-is.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message