lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Sturge <peter.stu...@googlemail.com>
Subject Scaling indexes with high document count
Date Wed, 10 Mar 2010 08:38:11 GMT
Hello,

I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of
data.

On the search side, this would mean 6 shards (as replicas), each holding
~120GB with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a
time. When the active core gets 'full' (e.g. a month has passed), the other
core kicks in for live indexing while the other completes its replication to
it searcher(s). It's then cleared, ready for the next time period. Each time
there is a 'switch', the next available replica is cleared and told to
replicate to the newly active indexing core. After 6 months, the first
replica is re-used, and so on...
This type of layout allows indexing to carry on pretty much uninterrupted,
and makes it relatively easy to manage replicas separately from the indexers
(e.g. add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores
wouldn't be tuned with much autowarming/read cache, but have loads of
'maxdocs' cache. The searchers would be the other way 'round - lots of
filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw,
client searches use faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially
1.3billion document entries, each with up to 30-80 facets (+potentially lots
of unique values), plus date faceting and range queries, and still keep
search performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the
caching/faceting/distributed searches?

What advice would you give to get the best performance out of such a
scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks, Yonik and Lucid for your excellent Mastering Solr webinar
- really useful and highly informative!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message