lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <>
Subject [jira] Commented: (LUCENE-627) highlighter problems with overlapping tokens
Date Fri, 14 Jul 2006 06:54:31 GMT
    [ ] 

Mark Harwood commented on LUCENE-627:

>>It seems like maybe the only way to handle some of this stuff is two passes

The highlighter does not expect token positions to "rewind" in this manner. I'm not sure where
this ends. Imagine an analyzer, which having considered and emitted tokens for a whole document,
chooses to append some  tokens positioned which  has offsets referencing much earlier sections
of the document. (Why, I'm not sure but there's nothing to say this couldn't happen).

>>It seems like maybe the only way to handle some of this stuff is two passes

Maybe a special "OrderFixer" TokenStream could be used by to wrap "rewinding" token streams
such as yours and then accumulate all tokens in a  buffer before then sorting and outputting
them in ascending start offset order. If the Highlighter ignored position increment and just
used offsets (as it does currently) I suspect all would be OK

> highlighter problems with overlapping tokens
> --------------------------------------------
>          Key: LUCENE-627
>          URL:
>      Project: Lucene - Java
>         Type: Bug

>   Components: Other
>     Versions: 2.0.1
>     Reporter: Yonik Seeley

> The lucene highlighter has problems when tokens that overlap are generated.
> For example, if analysis of iPod generates the tokens "i", "pod", "ipod" (with pod and
ipod in the same position),
> then the highlighter will output this as iipod, regardless of if any of those tokens
are highlighted.
> Discovered via

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message