Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <3EC27081.1030008@lucene.com>
Date: Wed, 14 May 2003 09:36:17 -0700
From: Doug Cutting <cutting@lucene.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313
MIME-Version: 1.0
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Term highlighting
References: <007501c308af$ed68c030$49098c92@pcara>
 <16470501145.20030422130437@osua.de> <3EAEEEC4.6030006@lucene.com>
 <200305142147.13301.jbaxter@panscient.com>
In-Reply-To: <200305142147.13301.jbaxter@panscient.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Jonathan Baxter wrote:
> I have been looking at implementing highlighting of the terms in the 
> documents returned by Lucene. I'd rather not have to retokenize the 
> document on-the-fly in order to locate the terms, since this is slow 
> and wasteful

Have you actually implemented this and found it to be too slow in your 
application?  I suspect not.

Since most folks only display around 10 hits at a time, it is typically 
quite fast to re-tokenize these.  Keep in mind that, even if you knew 
the positions of the matching tokens you'll need to scan the text of the 
document some to construct a context string.  And typically you'll not 
be interested in showing all of the matches in the document, but only a 
handful of the better matches.  The practical advantages of knowing 
character positions is thus usually quite small.

> - have I missed something obvious and in fact there is a simple way to 
> extract term-location information for a specific document from the 
> lucene index?

No, Lucene does not provide this.

> - if not, would it be horribly slow to try and do it post-facto after 
> hits have been found by scanning through the ".prx" file from the 
> start of the information for each term in the query?

Yes, this would be slow, about as slow as running the query again.  And 
it would only give you the ordinal position of the term, not its 
character position.

> - if the answer to the second question is "yes - horribly slow", would 
> it make sense then to add an extra field to each entry in the ".frq" 
> file indicating where the location information for the term and 
> document is in the ".prx" file (ie, the .frq file info for each term 
> would consist of a series of <doc_num, freq, prx_pointer_offset> 
> triples where prx_pointer_offset gives the number of bytes to skip in 
> the .prx file to get to the location information for the specified 
> document)? The prx_pointer_offset could then be used in a boolean 
> query to compute pointers for each hit indicating where in the .prx 
> file the location information for each term starts. 

This would nearly double the size of the .frq file, and thus make 
searches nearly twice as slow, as they'd have to process double the 
data.  (Frequency entries only require a couple of bits on average, so 
the majority of space in the .frq is document numbers.)  And still, 
you'd only have the ordinal position.

Also, the bookkeeping and memory required to track and store the 
positions of each match would make search a lot slower.

In short, re-tokenizing is the most efficient way to do term 
highlighting, especially when you consider the expense of the 
alternatives on the rest of the system.  There's no point in making 
highlighting fast if it makes searches slow.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org