Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 35260 invoked from network); 14 May 2003 16:36:09 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 14 May 2003 16:36:09 -0000 Received: (qmail 28968 invoked by uid 97); 14 May 2003 16:38:16 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 28961 invoked from network); 14 May 2003 16:38:15 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 14 May 2003 16:38:15 -0000 Received: (qmail 34972 invoked by uid 500); 14 May 2003 16:36:05 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 34961 invoked from network); 14 May 2003 16:36:05 -0000 Received: from sccrmhc03.attbi.com (204.127.202.63) by daedalus.apache.org with SMTP; 14 May 2003 16:36:05 -0000 Received: from lucene.com (12-210-200-74.client.attbi.com[12.210.200.74]) by attbi.com (sccrmhc03) with SMTP id <2003051416360700300e2181e>; Wed, 14 May 2003 16:36:08 +0000 Message-ID: <3EC27081.1030008@lucene.com> Date: Wed, 14 May 2003 09:36:17 -0700 From: Doug Cutting User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Term highlighting References: <007501c308af$ed68c030$49098c92@pcara> <16470501145.20030422130437@osua.de> <3EAEEEC4.6030006@lucene.com> <200305142147.13301.jbaxter@panscient.com> In-Reply-To: <200305142147.13301.jbaxter@panscient.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Jonathan Baxter wrote: > I have been looking at implementing highlighting of the terms in the > documents returned by Lucene. I'd rather not have to retokenize the > document on-the-fly in order to locate the terms, since this is slow > and wasteful Have you actually implemented this and found it to be too slow in your application? I suspect not. Since most folks only display around 10 hits at a time, it is typically quite fast to re-tokenize these. Keep in mind that, even if you knew the positions of the matching tokens you'll need to scan the text of the document some to construct a context string. And typically you'll not be interested in showing all of the matches in the document, but only a handful of the better matches. The practical advantages of knowing character positions is thus usually quite small. > - have I missed something obvious and in fact there is a simple way to > extract term-location information for a specific document from the > lucene index? No, Lucene does not provide this. > - if not, would it be horribly slow to try and do it post-facto after > hits have been found by scanning through the ".prx" file from the > start of the information for each term in the query? Yes, this would be slow, about as slow as running the query again. And it would only give you the ordinal position of the term, not its character position. > - if the answer to the second question is "yes - horribly slow", would > it make sense then to add an extra field to each entry in the ".frq" > file indicating where the location information for the term and > document is in the ".prx" file (ie, the .frq file info for each term > would consist of a series of > triples where prx_pointer_offset gives the number of bytes to skip in > the .prx file to get to the location information for the specified > document)? The prx_pointer_offset could then be used in a boolean > query to compute pointers for each hit indicating where in the .prx > file the location information for each term starts. This would nearly double the size of the .frq file, and thus make searches nearly twice as slow, as they'd have to process double the data. (Frequency entries only require a couple of bits on average, so the majority of space in the .frq is document numbers.) And still, you'd only have the ordinal position. Also, the bookkeeping and memory required to track and store the positions of each match would make search a lot slower. In short, re-tokenizing is the most efficient way to do term highlighting, especially when you consider the expense of the alternatives on the rest of the system. There's no point in making highlighting fast if it makes searches slow. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org