lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen>
Subject Re: capacity planning
Date Tue, 11 Oct 2011 17:49:25 GMT
Travis Low [] wrote:
> Toke, thanks.  Comments embedded (hope that's okay):

Inline or top-posting? Long discussion, but for mailing lists I clearly prefer the former.

[Toke: Estimate characters]

> Yes.  We estimate each of the 23K DB records has 600 pages of text for the
> combined documents, 300 words per page, 5 characters per word.  Which
> coincidentally works out to about 21GB, so good guessing there. :)

Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does not sound scary
at all.

> The way it works is we have researchers modifying the DB records during the
> day, and they may upload documents at that time.  We estimate 50-60 uploads
> throughout the day.  If possible, we'd like to index them as they are
> uploaded, but if that would negatively affect the search, then we can
> rebuild the index nightly.
> Which is better?

The analyzing part is only CPU and you're running multi-core so as long as you only analyze
using one thread you're safe there. That leaves us with I/O: Even for spinning drives, a daily
load of just 60 updates of 1MB of extracted text each shouldn't have any real effect - with
the usual caveat that large merges should be avoided by either optimizing at night or tweaking
merge policy to avoid large segments. With such a relatively small index, (re)opening and
warm up should be painless too.

Summary: 300GB is a fair amount of data and takes some power to crunch. However, in the Solr/Lucene
end your index size and your update rates are nothing to worry about. Usual caveat for advanced
use and all that applies.

[Toke: i7, 8GB, 1TB spinning, 256GB SSD]

> We have a very beefy VM server that we will use for benchmarking, but your
> specs provide a starting point.  Thanks very much for that.

I have little experience with VM servers for search. Although we use a lot of virtual machines,
we use dedicated machines for our searchers, primarily to ensure low latency for I/O. They
might be fine for that too, but we haven't tried it yet.

Glad to be of help,
View raw message