Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 97883 invoked from network); 28 Oct 2003 10:52:19 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 28 Oct 2003 10:52:19 -0000 Received: (qmail 63086 invoked by uid 500); 28 Oct 2003 10:51:49 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 62962 invoked by uid 500); 28 Oct 2003 10:51:48 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 62949 invoked from network); 28 Oct 2003 10:51:47 -0000 Received: from unknown (HELO Clarion.ucd.ie) (193.1.172.140) by daedalus.apache.org with SMTP; 28 Oct 2003 10:51:47 -0000 Received: from CONVERSION-DAEMON.Clarion.ucd.ie by Clarion.ucd.ie (PMDF V6.2-X17 #30834) id <0HNG06Z01RI55H@Clarion.ucd.ie> for lucene-user@jakarta.apache.org; Tue, 28 Oct 2003 10:51:45 +0000 (GMT) Received: from mojo (mojo.ucd.ie [193.1.132.191]) by Clarion.ucd.ie (PMDF V6.2-X17 #30834) with SMTP id <0HNG070HARFL84@Clarion.ucd.ie> for lucene-user@jakarta.apache.org; Tue, 28 Oct 2003 10:50:09 +0000 (GMT) Date: Tue, 28 Oct 2003 10:50:10 +0000 From: Maurice Coyle Subject: Re: large index query time To: Lucene Users List Message-id: <003e01c39d41$4269c950$bf8401c1@mojo> MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Mailer: Microsoft Outlook Express 6.00.2720.3000 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: 7BIT X-Priority: 3 X-MSMail-priority: Normal References: <5C7A32E240A279479DFF44FDF01BBE62875E9B@sgges01.nstein.com> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N that's very helpful, thanks to all who replied. my index is definitely larger than my RAM so i guess the increase in query time is due to an increase in time to open the index/perform a search. thanks again, maurice ----- Original Message ----- From: "Tate Avery" To: "Lucene Users List" Sent: Friday, October 24, 2003 5:33 PM Subject: RE: large index query time Below are some posts from Doug (circa 2001) that I found very helpful with regard to understanding Lucene scalability. I am assuming that they are still generally applicable. You might also find them useful. Tate ----------------------------------------------------------- Performance for large indices is frequently governed by i/o performance. If an index is larger than RAM then searches will need to read data from disk. This can quickly become a bottleneck. A search for a term that occurs in a million documents can require over 1MB of data, which can take some time to read. With multiple searching threads, the disk can easily become a bottleneck. Disk arrays can alleviate this, more RAM helps even more! For some folks, queries that take over a second are unacceptable, for others, ten seconds is okay. Performance should be more-or-less linear: a two-million document index will be almost twice as slow to search as a one-million document index. There are lots of factors, including document size, CPU-speed, RAM-size, i/o subsystem, but a rough rule-of-thumb for Lucene performance might be that, in a "typical" configuration, it can search a million documents per second. So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB of RAM with sub-second response time, Lucene will probably fail. But if you need to search two million 2kB documents on a 500Mhz Pentium with 128MB of RAM in a couple of seconds per query, you're probably okay. - Doug Cutting (10/08/2001) Some more precise statements: The cost to search for a term is proportional to the number of documents that contain that term. The cost to search for a phrase is proportional to the sum of the number of occurrences of its constituent terms. The cost to execute a boolean query is the sum of the costs of its sub-queries. Longer documents contain more terms: usually both more unique terms and more occurrences. Total vocabulary size is not a big factor in search performance. When you open an index Lucene does read one out of every 128 unique terms into a table, so an index with a large number of unique terms will be slower to open. Searching that table for query terms is also slower for bigger indexes, but the time to search that table is not significant in overall performance. Lucene also reads at index open one byte per document per indexed field (the normalization factor). So an index with lots of documents and fields will also be slower to open. But, once opened, the cost of searching is largely dependent on the frequency characteristics of query terms. And, since IndexReaders and Searchers are thread safe, you don't need to open indexes very often. - Doug Cutting (10/08/2001) -----Original Message----- From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com] Sent: October 24, 2003 1:33 PM To: 'Lucene Users List' Subject: RE: large index query time My experience is that the query time (and memory usage) can be affected greatly by booleans that retrieve lots of results. Are you finding it slow when doing a simple query that should return only a handful of results, or is it on more complex queries? -----Original Message----- From: Maurice Coyle [mailto:maurice.coyle@ucd.ie] Sent: Friday, October 24, 2003 1:29 PM To: Lucene Users List Subject: large index query time hi, i recently merged a whole lot of indexes into one big index for testing purposes. however, now the programs i use to search the index are taking much longer. this may be a stupid question (or very simple) and please tell me if it is, but should this be the case? i mean, i realise it'll take longer to search over a larger collection, but it's taking an order of magnitude longer. this is the reaosn i'm asking, since if lucene is capable of handling large-scale search apps presumably it's set up to search large collections rapidly. maybe there's some steps i can take to speed things up (i optimised the big index when it was finished being created) or something i'm missing? if i can give any information which will help the diagnosis of this problem please specify it. thanks, maurice --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org