Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 66133 invoked from network); 20 Jan 2004 17:48:26 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 20 Jan 2004 17:48:26 -0000 Received: (qmail 94638 invoked by uid 500); 20 Jan 2004 17:48:16 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 94618 invoked by uid 500); 20 Jan 2004 17:48:15 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 94604 invoked from network); 20 Jan 2004 17:48:15 -0000 Received: from unknown (HELO razorbill.mail.pas.earthlink.net) (207.217.121.248) by daedalus.apache.org with SMTP; 20 Jan 2004 17:48:15 -0000 Received: from dialup-63.208.66.154.dial1.chicago1.level3.net ([63.208.66.154] helo=ENGELSSERVER) by razorbill.mail.pas.earthlink.net with asmtp (Exim 3.33 #1) id 1Aizz8-0001bV-00 for lucene-dev@jakarta.apache.org; Tue, 20 Jan 2004 09:48:19 -0800 Reply-To: From: "Robert Engels" To: "Lucene Developers List" Subject: RE: strange lucene search behavior? Date: Tue, 20 Jan 2004 11:48:18 -0600 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) In-Reply-To: <400D6595.4080504@apache.org> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7b2728ff8d3d716ca34e227a26accf7a6abab106febe580a8e350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Doug, I changed the search to use the HitCollector, and then indeed, it only requests the TermDocs for a term once, so in this case, the LRU would not help at all for a single query. The TermDocs LRU should help a lot for repeated common searches, (which is common with our application), especially when filtering using a date range, because almost always the user is looking for 'recent' documents. I can work the LRU into the code base - my imp does not use the LinkedHashMap, but rather a HashMap and an associated LinkedList, so there would not be any 1.4 issues. Since the index is never rewritten, and if the cache is per Index, I do not need to worry about a stale cache, correct? Robert -----Original Message----- From: Doug Cutting [mailto:cutting@apache.org] Sent: Tuesday, January 20, 2004 11:30 AM To: Lucene Developers List Subject: Re: strange lucene search behavior? Robert, Some time back someone benchmarked adding an LRU cache to TermInfosReader and were unable to see any significant overall speedup in query processing. If you find otherwise, please submit a patch. Java 1.4's LinkedHashMap would make the implementation of such a cache very simple, but, unfortunately, not all Lucene users are using 1.4 yet. Also, if you wish to retrieve all of the hits, rather than just a portion, please use the HitCollector API rather than the Hits API. The Hits API is optimized for applications which are only displaying a few of the top hits. Doug Robert Engels wrote: > In working with Lucene, I notice that when performing searches, it retrieves > the documents for the same term multiple times. I think this may be because > the Hits collection only stores a certain number of items, but would it not > be better to just increase the size of the Hits collection, rather than > perform the extra, relatively very expensive, read of the term docs. > > The following is the trace output from Lucene performing 2 single term > searches, and a multiple term search: (notice that in each case, the > documents for a term are asked for twice). > > expression = +epson, query = +text:epson > findTermInfo() text:epson, time = 0 > SearchTermDocs, seek() on text:epson > SearchTermDocs, seek() on text:epson [cached] > find, hits = 224, query time = 16, doc (150) time = 15, total time = 31 > > expression = +printer, query = +text:printer > findTermInfo() text:printer, time = 16 > SearchTermDocs, seek() on text:printer > SearchTermDocs, seek() on text:printer [cached] > find, hits = 5358, query time = 62, doc (150) time = 282, total time = 344 > > expression = +epson +printer, query = +text:epson +text:printer > SearchTermDocs, seek() on text:epson [cached] > SearchTermDocs, seek() on text:printer [cached] > SearchTermDocs, seek() on text:epson [cached] > SearchTermDocs, seek() on text:printer [cached] > find, hits = 175, query time = 15, doc (150) time = 47, total time = 62 > > In order to limit the performance hit, or implementation caches the returned > docs within a query (the [cached] tag), but it seems the issue would be > better addressed by the Lucene engine. > > Any thoughts on this? > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org