lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: CachingDirectory contribution
Date Tue, 09 Oct 2001 16:04:29 GMT
> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
> 
> Doug, what about scaling with the number of concurrent 
> queries? In other 
> words, assuming that the queries are similar, how is the search 
> performance affected by increase in the number of concurrent users?

On a single-CPU, single-disk machine, throughput will likely be best with a
few simultaneous queries.  For example, if each query were to spend half of
its time on the CPU and half the time waiting for i/o, and queries take two
seconds total, then the ideal would be to have queries arrive every second,
so that while one was using the CPU the other could use the disk.  Things
don't usually work out quite so neatly, but you get the idea.

Multi-CPU and/or multi-disk systems can provide greater parallelism and
hence query throughput.  However Lucene's FSDirectory serializes reads to a
given file (since it only has a single file descriptor per file) which
limits i/o parallelism.  Someone with a large disk array would be better
served by a Directory implementation that uses Java 1.4's new i/o classes.
In particular, the FileChannel class supports reads that do not move the
file pointer, so that multiple reads on the same file can be in progress at
the same time.

Instead of building big boxes, another approach to scaling performance is to
use lots of little boxes.  Here you split the index into sub-indexes,
broadcast the search to all of the sub-indexes, and merge the results.  To
minimize the number of boxes, you want to give each box the largest index it
can search quickly, typically one that mostly fits in RAM.  It would be
fairly easy to implement such a distributed search system with Lucene,
writing a multi-threaded version of MultiSearcher that uses RPC to send
queries to remote systems.

Deciding which of adding CPUs, disks or RAM is most cost-effective is
complex.  In short, the best approach is to benchmark various
configurations.

The big internet search engines do all of the above, plus other tricks.  For
example, they can keep two indexes: one containing one million "popular"
pages, and another containing 100 million less popular pages.  If they don't
let folks see past the 1000th hit, and they find 1500 hits in the popular
index, then they can return just the hits from the popular index, and report
that there are 150,000 hits total.  They don't know how many there really
are, but that's a good estimate and no one can prove them wrong.  If they
can thus answer the vast majority of queries from a small, "popular" index,
then they don't have to make the big index as scalable, and can save lots of
money.

Doug

Mime
View raw message