Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <20040331132037.55714.qmail@web9501.mail.yahoo.com>
Date: Wed, 31 Mar 2004 14:20:37 +0100 (BST)
From: =?iso-8859-1?q?mark=20harwood?= <markharw00d@yahoo.co.uk>
Subject: RE : Performance of hit highlighting and finding term positions for a
 specific document
To: lucene-user@jakarta.apache.org
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-1633876089-1080739237=:55517"
Content-Transfer-Encoding: 8bit

--0-1633876089-1080739237=:55517
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

I intend to release a new version of the highlighter soon that should (hopefully) address some of the issues under discussion.
The re-design will be based on the following principles:
 
* A TokenStream will be passed to the highlighter to provide the source of tokens. The token stream could be provided by an analyzer re-tokenizing the document text or by some future extension to Lucene that is capable of storing token offsets in the index. This change effectively abstracts the highlighter from the index-time or query-time tokenization choices people should be free to make.
 
* The highlighter already has an abstraction from the list of terms that are needed to be highlighted - see TextHighlighter. The only change I plan here is to introduce the notion of a WeightedTerm that associates a weight with each term to be highlighted in order to influence selection of the best fragments. The QueryHighlightExtractor class will be deprecated and will simply become a tool for extracting a list of terms from a query so that they can be passed to the TextHighlighter class.
 
* I will make the fragmentation logic a pluggable class that can change the way the highlighter decides to split documents into fragments. The current implementation simply splits after "n" tokens. I will introduce a new DocSplitter interface to allow alternative implementations to split documents up, eg based on recognizing end of sentences by the ?!. characters. I dont plan to provide a sentence splitter at this stage - too much work!
 
Hopefully this provides an open framework for folks to do what they want with the highlighter. Please let me have any comments if you have any suggestions.
 
As for ownership/support, we already had the vote on whether highlighter is accepted as part of Lucene core or not and it wasn't. I dont mind either way but I would like to make the above changes before anyone considers moving it to the sandbox or wherever.
 
I'm going to be away for the next 5 days so there may be a delay in any work on this.
 
Cheers
Mark

		
---------------------------------
 WIN FREE WORLDWIDE FLIGHTS - nominate a cafe in the Yahoo! Mail Internet Cafe Awards
--0-1633876089-1080739237=:55517--