Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 47341 invoked from network); 30 Oct 2003 07:53:57 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 30 Oct 2003 07:53:57 -0000 Received: (qmail 27051 invoked by uid 500); 30 Oct 2003 07:53:30 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 26912 invoked by uid 500); 30 Oct 2003 07:53:29 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 26898 invoked from network); 30 Oct 2003 07:53:28 -0000 Received: from unknown (HELO bigdns.bigonthenet.com) (203.120.211.133) by daedalus.apache.org with SMTP; 30 Oct 2003 07:53:28 -0000 Received: from smile ([203.120.211.130]) by bigdns.bigonthenet.com (8.9.3/8.9.3) with SMTP id OAA15930 for ; Thu, 30 Oct 2003 14:45:47 +0800 Message-ID: <127601c39ebc$283afaa0$0f02a8c0@smile> From: "Alex Aw Seat Kiong" To: "Lucene Users List" References: <5C7A32E240A279479DFF44FDF01BBE62875E9B@sgges01.nstein.com> <003e01c39d41$4269c950$bf8401c1@mojo> <121701c39e93$36368670$0f02a8c0@smile> Subject: lucene indexing and searching engine performance Date: Thu, 30 Oct 2003 16:01:56 +0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1158 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi Doug Cutting ! That's really very helpful, thanks to Doug. I'm doing the performance research of the lucene speed of indexing and searching engine. So, isn't able to give me more details of 1. searching >But if you > need to search two million 2kB documents on a 500Mhz Pentium with 128MB of > RAM in a couple of seconds per query, you're probably okay. What is the other hardware spec, like - SCSI harddisk or IDE harddisk? If it's SCSI harddisk, what is the model of the harddisk and SCSI card model, XXXX PRM? - Which OS was use for this performance testing? - Which Application Server was use for this performance testing? 2. indexing (assume the hardware and software spec is same as searching server) Index space should be generally less than the original document size, right? Assume, for 500MB Disk Space for the application, Max index size : should been more than 250,000 document in 2 KB size, right? Max Speed of indexing : ??? documents in 2KB size per hours Can share the performance test was done to among of us? Thank You. Regards, AlexAw ----- Original Message ----- From: "Maurice Coyle" To: "Lucene Users List" Sent: Tuesday, October 28, 2003 6:50 PM Subject: Re: large index query time > that's very helpful, thanks to all who replied. > > my index is definitely larger than my RAM so i guess the increase in query > time is due to an increase in time to open the index/perform a search. > > thanks again, > maurice > > > ----- Original Message ----- > From: "Tate Avery" > To: "Lucene Users List" > Sent: Friday, October 24, 2003 5:33 PM > Subject: RE: large index query time > > > > Below are some posts from Doug (circa 2001) that I found very helpful with > regard to understanding Lucene scalability. I am assuming that they are > still generally applicable. You might also find them useful. > > Tate > > > ----------------------------------------------------------- > > > Performance for large indices is frequently governed by i/o performance. If > an index is larger than RAM then searches will need to read data from disk. > This can quickly become a bottleneck. A search for a term that occurs in a > million documents can require over 1MB of data, which can take some time to > read. With multiple searching threads, the disk can easily become a > bottleneck. Disk arrays can alleviate this, more RAM helps even more! > > For some folks, queries that take over a second are unacceptable, for > others, ten seconds is okay. > > Performance should be more-or-less linear: a two-million document index will > be almost twice as slow to search as a one-million document index. There > are lots of factors, including document size, CPU-speed, RAM-size, i/o > subsystem, but a rough rule-of-thumb for Lucene performance might be that, > in a "typical" configuration, it can search a million documents per second. > > So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB > of RAM with sub-second response time, Lucene will probably fail. But if you > need to search two million 2kB documents on a 500Mhz Pentium with 128MB of > RAM in a couple of seconds per query, you're probably okay. > > - Doug Cutting (10/08/2001) > > > Some more precise statements: The cost to search for a term is proportional > to the number of documents that contain that term. The cost to search for a > phrase is proportional to the sum of the number of occurrences of its > constituent terms. The cost to execute a boolean query is the sum of the > costs of its sub-queries. Longer documents contain more terms: usually both > more unique terms and more occurrences. > > Total vocabulary size is not a big factor in search performance. When you > open an index Lucene does read one out of every 128 unique terms into a > table, so an index with a large number of unique terms will be slower to > open. Searching that table for query terms is also slower for bigger > indexes, but the time to search that table is not significant in overall > performance. Lucene also reads at index open one byte per document per > indexed field (the normalization factor). So an index with lots of > documents and fields will also be slower to open. But, once opened, the > cost of searching is largely dependent on the frequency characteristics of > query terms. And, since IndexReaders and Searchers are thread safe, you > don't need to open indexes very often. > > - Doug Cutting (10/08/2001) > > > > > > -----Original Message----- > From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com] > Sent: October 24, 2003 1:33 PM > To: 'Lucene Users List' > Subject: RE: large index query time > > > My experience is that the query time (and memory usage) can be affected > greatly by booleans that retrieve lots of results. > > Are you finding it slow when doing a simple query that should return only a > handful of results, or is it on more complex queries? > > -----Original Message----- > From: Maurice Coyle [mailto:maurice.coyle@ucd.ie] > Sent: Friday, October 24, 2003 1:29 PM > To: Lucene Users List > Subject: large index query time > > > hi, > i recently merged a whole lot of indexes into one big index for testing > purposes. however, now the programs i use to search the index are taking > much longer. this may be a stupid question (or very simple) and please tell > me if it is, but should this be the case? i mean, i realise it'll take > longer to search over a larger collection, but it's taking an order of > magnitude longer. this is the reaosn i'm asking, since if lucene is capable > of handling large-scale search apps presumably it's set up to search large > collections rapidly. > > maybe there's some steps i can take to speed things up (i optimised the big > index when it was finished being created) or something i'm missing? if i > can give any information which will help the diagnosis of this problem > please specify it. > > thanks, > maurice > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org