lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tate Avery" <tate.av...@nstein.com>
Subject RE: large index query time
Date Fri, 24 Oct 2003 17:33:17 GMT

Below are some posts from Doug (circa 2001) that I found very helpful with regard to understanding
Lucene scalability.  I am assuming that they are still generally applicable.  You might also
find them useful.

Tate


-----------------------------------------------------------


Performance for large indices is frequently governed by i/o performance.  If
an index is larger than RAM then searches will need to read data from disk.
This can quickly become a bottleneck.  A search for a term that occurs in a
million documents can require over 1MB of data, which can take some time to
read.  With multiple searching threads, the disk can easily become a
bottleneck.  Disk arrays can alleviate this, more RAM helps even more!

For some folks, queries that take over a second are unacceptable, for
others, ten seconds is okay.

Performance should be more-or-less linear: a two-million document index will
be almost twice as slow to search as a one-million document index.  There
are lots of factors, including document size, CPU-speed, RAM-size, i/o
subsystem, but a rough rule-of-thumb for Lucene performance might be that,
in a "typical" configuration, it can search a million documents per second.

So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB
of RAM with sub-second response time, Lucene will probably fail.  But if you
need to search two million 2kB documents on a 500Mhz Pentium with 128MB of
RAM in a couple of seconds per query, you're probably okay.

- Doug Cutting (10/08/2001)


Some more precise statements: The cost to search for a term is proportional
to the number of documents that contain that term.  The cost to search for a
phrase is proportional to the sum of the number of occurrences of its
constituent terms.  The cost to execute a boolean query is the sum of the
costs of its sub-queries.  Longer documents contain more terms: usually both
more unique terms and more occurrences.

Total vocabulary size is not a big factor in search performance.  When you
open an index Lucene does read one out of every 128 unique terms into a
table, so an index with a large number of unique terms will be slower to
open.  Searching that table for query terms is also slower for bigger
indexes, but the time to search that table is not significant in overall
performance.  Lucene also reads at index open one byte per document per
indexed field (the normalization factor).  So an index with lots of
documents and fields will also be slower to open.  But, once opened, the
cost of searching is largely dependent on the frequency characteristics of
query terms.  And, since IndexReaders and Searchers are thread safe, you
don't need to open indexes very often.

- Doug Cutting (10/08/2001)





-----Original Message-----
From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
Sent: October 24, 2003 1:33 PM
To: 'Lucene Users List'
Subject: RE: large index query time


My experience is that the query time (and memory usage) can be affected
greatly by booleans that retrieve lots of results.

Are you finding it slow when doing a simple query that should return only a
handful of results, or is it on more complex queries?

-----Original Message-----
From: Maurice Coyle [mailto:maurice.coyle@ucd.ie]
Sent: Friday, October 24, 2003 1:29 PM
To: Lucene Users List
Subject: large index query time


hi,
i recently merged a whole lot of indexes into one big index for testing
purposes.  however, now the programs i use to search the index are taking
much longer.  this may be a stupid question (or very simple) and please tell
me if it is, but should this be the case?  i mean, i realise it'll take
longer to search over a larger collection, but it's taking an order of
magnitude longer.  this is the reaosn i'm asking, since if lucene is capable
of handling large-scale search apps presumably it's set up to search large
collections rapidly.

maybe there's some steps i can take to speed things up (i optimised the big
index when it was finished being created) or something i'm missing?  if i
can give any information which will help the diagnosis of this problem
please specify it.

thanks,
maurice


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message