Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 78534 invoked from network); 1 Sep 2004 07:48:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 1 Sep 2004 07:48:25 -0000 Received: (qmail 67578 invoked by uid 500); 1 Sep 2004 07:48:22 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 67537 invoked by uid 500); 1 Sep 2004 07:48:21 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 67522 invoked by uid 99); 1 Sep 2004 07:48:21 -0000 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=DNS_FROM_RFC_ABUSE,NO_REAL_NAME,SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from [194.106.33.237] (HELO outmail.freedom2surf.net) (194.106.33.237) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 01 Sep 2004 00:48:18 -0700 Received: from dell (i-194-106-34-5.freedom2surf.net [194.106.34.5]) by outmail.freedom2surf.net (8.12.10/8.12.10) with SMTP id i817mBnF010102 for ; Wed, 1 Sep 2004 08:48:12 +0100 Date: Wed, 1 Sep 2004 08:48:11 +0100 Message-Id: <200409010748.i817mBnF010102@outmail.freedom2surf.net> From: markharw00d@yahoo.co.uk To: lucene-dev@jakarta.apache.org Subject: Re: highlighting phrases X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Adding support for phrases could be tricky. So far I have deliberately avoided reimplementing specialized highlighting logic for each of the different types of queries eg understanding the nuances of "slop factor" in Phrase queries. I may be wrong but adding specialized support for different query types just feels like the start of a slippery slope. If people are keen to add such support though, here are some pointers to bear in mind... Remember that the highlighter is also designed to summarize docs by selecting best fragments. One decision to be made up front is to consider if a special "Fragmenter" implementation is required that uses the query to influence the way it breaks the doc into fragments ie. it ensures that matching words in phrase queries or span queries remain in the same fragment. If phrases matches are allowed to span fragments thought needs to be given as to how the fragments are scored. Do phrases/spans get marked up with one tag eg My Phrase or many eg My Phrase ? I expect "many" is the answer given the possibility of other query terms appearing intermingled in a phrase with a high slop factor or a span. The position of terms in the phrases will need to be known by the Formatter implementation before attempting to mark up the text. This could/should be done using position info in the Lucene index rather than requiring a separate analyzer pass over the original text. Most of this should be acheivable using specialized implementations of Formatter, Fragmenter and Scorer so the main Highlighter code should be untouched. These are just some of the "gotchas" off the top of my head. I'm sure there will be several more issues waiting to be revealed... Hope this helps anyway. Cheers Mark --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org