lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw...@yahoo.co.uk
Subject Re: amusing interaction between advanced tokenizers and highlighter
Date Sat, 19 Jun 2004 18:45:31 GMT
A question before I dive into coding a fix: can I assume (for all analyzers) that the tokens
produced by the tokenStream 
have the following property: 
   currentToken.startOffset() >= lastToken.startOffset()

The analyzers I have tested the highlighter with so far have the property:
   currentToken.startOffset() > lastToken.endOffset()
so aren't overlapping but I understand this isn't the case for others (all demonstrable examples
of such "problem" analyzers 
would be appreciated for testing purposes).
If I can assume that tokenstreams always produce a zero or more increment in token.startOffset
I think I can 
design a solution that still works using a single pass of the token stream.
I suspect an additional "flushText" method will be required on the Formatter interface to
allow implementations
to use a buffer. This buffer would be required to accumulate overlapping token scores when
trying to decide if a 
section of the original text merited any highlight markup.

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message