lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: CachingDirectory contribution
Date Mon, 08 Oct 2001 16:48:18 GMT
> From: Dave Kor [mailto:davekkw@yahoo.com]
> 
> This leads me to yet another of my buring questions..
> has anyone pushed Lucene to its limits yet? If so,
> what are they? What happens when Lucene hit its limit?
> Does it throw an exception? coredump? 

There are many limits that could be hit.  Lucene's design is that hard
limits should be hard to hit.  Lucene only caches a few critical data
structures in memory, in order to keep from hitting the JVM's heap size
limit, relying instead on the file system's caches for performance.  Lucene
uses 63-bit file pointers, so it will be a long time before raw index size
is a limit, however filesystems that do not support files larger than, e.g.,
2GB will limit things.  Document and term numbers are 31-bit, so two billion
documents or terms is another limit that will will probably not be hit too
soon.

Performance for large indices is frequently governed by i/o performance.  If
an index is larger than RAM then searches will need to read data from disk.
This can quickly become a bottleneck.  A search for a term that occurs in a
million documents can require over 1MB of data, which can take some time to
read.  With multiple searching threads, the disk can easily become a
bottleneck.  Disk arrays can alleviate this, more RAM helps even more!

For some folks, queries that take over a second are unacceptable, for
others, ten seconds is okay.

Performance should be more-or-less linear: a two-million document index will
be almost twice as slow to search as a one-million document index.  There
are lots of factors, including document size, CPU-speed, RAM-size, i/o
subsystem, but a rough rule-of-thumb for Lucene performance might be that,
in a "typical" configuration, it can search a million documents per second.

So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB
of RAM with sub-second response time, Lucene will probably fail.  But if you
need to search two million 2kB documents on a 500Mhz Pentium with 128MB of
RAM in a couple of seconds per query, you're probably okay.

Doug

Mime
View raw message