lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject RE : Performance of hit highlighting and finding term positions for a specific document
Date Wed, 31 Mar 2004 13:20:37 GMT
I intend to release a new version of the highlighter soon that should (hopefully) address some
of the issues under discussion.
The re-design will be based on the following principles:
* A TokenStream will be passed to the highlighter to provide the source of tokens. The token
stream could be provided by an analyzer re-tokenizing the document text or by some future
extension to Lucene that is capable of storing token offsets in the index. This change effectively
abstracts the highlighter from the index-time or query-time tokenization choices people should
be free to make.
* The highlighter already has an abstraction from the list of terms that are needed to be
highlighted - see TextHighlighter. The only change I plan here is to introduce the notion
of a WeightedTerm that associates a weight with each term to be highlighted in order to influence
selection of the best fragments. The QueryHighlightExtractor class will be deprecated and
will simply become a tool for extracting a list of terms from a query so that they can be
passed to the TextHighlighter class.
* I will make the fragmentation logic a pluggable class that can change the way the highlighter
decides to split documents into fragments. The current implementation simply splits after
"n" tokens. I will introduce a new DocSplitter interface to allow alternative implementations
to split documents up, eg based on recognizing end of sentences by the ?!. characters. I dont
plan to provide a sentence splitter at this stage - too much work!
Hopefully this provides an open framework for folks to do what they want with the highlighter.
Please let me have any comments if you have any suggestions.
As for ownership/support, we already had the vote on whether highlighter is accepted as part
of Lucene core or not and it wasn't. I dont mind either way but I would like to make the above
changes before anyone considers moving it to the sandbox or wherever.
I'm going to be away for the next 5 days so there may be a delay in any work on this.

 WIN FREE WORLDWIDE FLIGHTS - nominate a cafe in the Yahoo! Mail Internet Cafe Awards
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message