Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 56276 invoked from network); 10 May 2005 20:58:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 10 May 2005 20:58:33 -0000 Received: (qmail 71565 invoked by uid 500); 10 May 2005 21:02:02 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 71537 invoked by uid 500); 10 May 2005 21:02:02 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 71522 invoked by uid 99); 10 May 2005 21:02:02 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from smtp003.mail.ukl.yahoo.com (HELO smtp003.mail.ukl.yahoo.com) (217.12.11.34) by apache.org (qpsmtpd/0.28) with SMTP; Tue, 10 May 2005 14:02:02 -0700 Received: from unknown (HELO ?10.0.0.1?) (markharw00d@194.106.34.5 with plain) by smtp003.mail.ukl.yahoo.com with SMTP; 10 May 2005 20:58:24 -0000 Message-ID: <42812073.9050501@yahoo.co.uk> Date: Tue, 10 May 2005 21:58:27 +0100 From: markharw00d User-Agent: Mozilla Thunderbird 0.8 (Windows/20040913) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: multi-field highlighting References: <427BBFB9.1010405@apache.org> <427BCCD2.1080202@yahoo.co.uk> <42810A5B.6060608@apache.org> In-Reply-To: <42810A5B.6060608@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Doug Cutting wrote: > Shouldn't the search code already take care of that? No, the search may return documents that happen to contain "Doug Cutting" and Google - the current highlighter implementation uses all query terms (ignoring any AND/OR() operators) and looks for matches. Ideally "Doug Cutting" shouldn't be highlighted in the document "Doug Cutting loves google" when I searched for ("Doug Cutting" AND lucene) OR google. This is a nice-to-have and I suspect this is not an issue people feel strongly about. We could continue to ignore the complexities of representing the results of such boolean logic - most queries don't use it anyway. > The query should thus be compared to each potential highlight > fragment. This evaluation is different than the whole-document > evaluation performed by search. If no fragments match the entire > query, then fragments should be selected which, considered together, > match the entire query. Is this based on the approach (I think you suggested before now) to chop the doc into fragment-sized docs held in a RAM directory and then query it to get the best fragments? I think it would prove difficult to identify the combination of fragments that ultimately satisfied a query which contained complex boolean logic. My original idea for an approach was to let the queries initially generate a "heat map" which scored every token in the document. Any boolean queries which failed to be satisfied completely (eg the Doug AND lucene example) would not generate a score for its tokens. Phrase queries would only score the token occurences in the document where all tokens were grouped. The highlighter would then use the heat map to pick the best "runs" of tokens. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org