Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 49181 invoked from network); 31 Mar 2004 13:20:45 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 31 Mar 2004 13:20:45 -0000 Received: (qmail 82725 invoked by uid 500); 31 Mar 2004 13:20:37 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 82704 invoked by uid 500); 31 Mar 2004 13:20:37 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 82682 invoked from network); 31 Mar 2004 13:20:37 -0000 Received: from unknown (HELO web9501.mail.yahoo.com) (216.136.129.131) by daedalus.apache.org with SMTP; 31 Mar 2004 13:20:37 -0000 Message-ID: <20040331132037.55714.qmail@web9501.mail.yahoo.com> Received: from [193.36.230.96] by web9501.mail.yahoo.com via HTTP; Wed, 31 Mar 2004 14:20:37 BST Date: Wed, 31 Mar 2004 14:20:37 +0100 (BST) From: =?iso-8859-1?q?mark=20harwood?= Subject: RE : Performance of hit highlighting and finding term positions for a specific document To: lucene-user@jakarta.apache.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1633876089-1080739237=:55517" Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N --0-1633876089-1080739237=:55517 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit I intend to release a new version of the highlighter soon that should (hopefully) address some of the issues under discussion. The re-design will be based on the following principles: * A TokenStream will be passed to the highlighter to provide the source of tokens. The token stream could be provided by an analyzer re-tokenizing the document text or by some future extension to Lucene that is capable of storing token offsets in the index. This change effectively abstracts the highlighter from the index-time or query-time tokenization choices people should be free to make. * The highlighter already has an abstraction from the list of terms that are needed to be highlighted - see TextHighlighter. The only change I plan here is to introduce the notion of a WeightedTerm that associates a weight with each term to be highlighted in order to influence selection of the best fragments. The QueryHighlightExtractor class will be deprecated and will simply become a tool for extracting a list of terms from a query so that they can be passed to the TextHighlighter class. * I will make the fragmentation logic a pluggable class that can change the way the highlighter decides to split documents into fragments. The current implementation simply splits after "n" tokens. I will introduce a new DocSplitter interface to allow alternative implementations to split documents up, eg based on recognizing end of sentences by the ?!. characters. I dont plan to provide a sentence splitter at this stage - too much work! Hopefully this provides an open framework for folks to do what they want with the highlighter. Please let me have any comments if you have any suggestions. As for ownership/support, we already had the vote on whether highlighter is accepted as part of Lucene core or not and it wasn't. I dont mind either way but I would like to make the above changes before anyone considers moving it to the sandbox or wherever. I'm going to be away for the next 5 days so there may be a delay in any work on this. Cheers Mark --------------------------------- WIN FREE WORLDWIDE FLIGHTS - nominate a cafe in the Yahoo! Mail Internet Cafe Awards --0-1633876089-1080739237=:55517--