lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rasik Pandey" <>
Subject RE : amusing interaction between advanced tokenizers and highlighter
Date Sun, 20 Jun 2004 14:43:46 GMT

> A question before I dive into coding a fix: can I assume (for
> all analyzers) that the tokens produced by the tokenStream
> have the following property:
>    currentToken.startOffset() >= lastToken.startOffset()
> The analyzers I have tested the highlighter with so far have
> the property:
>    currentToken.startOffset() > lastToken.endOffset()
> so aren't overlapping but I understand this isn't the case for
> others (all demonstrable examples of such "problem" analyzers
> would be appreciated for testing purposes).

There is such an analyzer here .

> If I can assume that tokenstreams always produce a zero or more
> increment in token.startOffset I think I can
> design a solution that still works using a single pass of the
> token stream.
> I suspect an additional "flushText" method will be required on
> the Formatter interface to allow implementations
> to use a buffer. This buffer would be required to accumulate
> overlapping token scores when trying to decide if a
> section of the original text merited any highlight markup.

I am not familiar with your most recent highlighter package, but I have implemented this myself
with some older rudimentary highlighting code that just uses a Vector to keep track of all
tokens for the same offset positions. Highlighting based on those tokens accumulated in the
Vector is triggered when currentToken.startOffset() > lastToken.startOffset() is satisfied,
after which the token Vector is simply cleared and the new token position tracking begins.
Don't forget to make sure that the same input/term text isn't output/highlighted more than
once for multiple output tokens.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message