Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (hermes.apache.org: local policy)
Message-ID: <42812073.9050501@yahoo.co.uk>
Date: Tue, 10 May 2005 21:58:27 +0100
From: markharw00d <markharw00d@yahoo.co.uk>
User-Agent: Mozilla Thunderbird 0.8 (Windows/20040913)
MIME-Version: 1.0
To: java-dev@lucene.apache.org
Subject: Re: multi-field highlighting
References: <427BBFB9.1010405@apache.org> <427BCCD2.1080202@yahoo.co.uk>
 <42810A5B.6060608@apache.org>
In-Reply-To: <42810A5B.6060608@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Doug Cutting wrote:

> Shouldn't the search code already take care of that?  

No, the search may return documents that happen to contain "Doug 
Cutting" and Google - the current highlighter implementation uses all 
query terms (ignoring any AND/OR() operators) and looks for matches. 
Ideally "Doug Cutting" shouldn't be highlighted in the document "Doug 
Cutting loves google" when I searched for ("Doug Cutting" AND lucene) OR 
google.

This is a nice-to-have and I suspect this is not an issue people feel 
strongly about. We could continue to ignore the complexities of 
representing the results of such boolean logic - most queries don't use 
it anyway.

> The query should thus be compared to each potential highlight 
> fragment.  This evaluation is different than the whole-document 
> evaluation performed by search.  If no fragments match the entire 
> query, then fragments should be selected which, considered together, 
> match the entire query.

Is this based on the approach (I think you suggested before now) to chop 
the doc into fragment-sized docs held in a RAM directory and then query 
it to get the best fragments? I think it would prove difficult to 
identify the combination of fragments that ultimately satisfied a query 
which contained complex boolean logic.

My original idea for an approach was to let the queries initially 
generate a "heat map" which scored every token in the document. Any 
boolean queries which failed to be satisfied completely (eg the Doug AND 
lucene example) would not generate a score for its tokens. Phrase 
queries would only score the token occurences in the document where all 
tokens were grouped.
The highlighter would then use the heat map to pick the best "runs" of 
tokens.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org