lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: SolrCloud loadbalancing, replication, and failover
Date Sun, 21 Apr 2013 14:25:42 GMT
One note to add. There's been lots of discussion here about
"index size", which is a slippery concept. To whit:
Look at your index directory, specifically the *.fdt and *.fdx files.
That's where the verbatim copy of your data is held, i.e.
whenever you specify 'stored="true"', and is almost totally irrelevant
to memory needs for searching, that data is accessed when the
final set of documents have been assembled and the fl list is
being populated for them.

So, an index with 39G of stored data and 1G for the rest has much
different memory requirements that 1G of stored data and 39G for
the rest, where "the rest" == "the searchable part that can be
held in RAM".

Then there's the fact that the actual data in the index doesn't
include dynamic structures required for navigating that data, so just
because your non-stored data consumes 10G of data on your disk
doesn't mean it'll actually all fit in 10G of memory.

Quick example. Your filter cache consists of a key that is the  filter
query and maxDoc/8 bytes. So I can configure a doc with 64M docs
will require 8M bytes (ignoring some overhead). Not bad so far. But
now I keep doing unfortunate filter queries that use NOW, so each
one requires an additional 8M of memory. And this is a static index
so we never open new readers. And I've configured my filter cache to hold
1,000,000 entries (I have seen this). Works fine in my test environment
where I'm bouncing the server pretty frequently, but now I put it in my
production environment and it starts blowing up with OOM errors after
running for a while.

So try. Measure. Rinse, Repeat <G>

Best
Erick

On Fri, Apr 19, 2013 at 10:33 PM, David Parks <davidparks21@yahoo.com> wrote:
> Again, thank you for this incredible information, I feel on much firmer
> footing now. I'm going to test distributing this across 10 servers,
> borrowing a Hadoop cluster temporarily, and see how it does with enough
> memory to have the whole index cached. But I'm thinking that we'll try the
> SSD route as our index will probably rest in the 1/2 terabyte range
> eventually, there's still a lot of active development.
>
> I guess the RAM disk would work in our case also, as we only index in
> batches, and eventually I'd like to do that off of Solr and just update the
> index (I'm presuming this is doable in solr cloud, but I haven't put it to
> task yet). If I could purpose Hadoop to index the shards, that would be
> ideal, though I haven't quite figured out how to go about it yet.
>
> David
>
>
> -----Original Message-----
> From: Shawn Heisey [mailto:solr@elyograg.org]
> Sent: Friday, April 19, 2013 9:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud loadbalancing, replication, and failover
>
> On 4/19/2013 3:48 AM, David Parks wrote:
>> The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has
>> dark grey allocation of 602MB, and light grey of an additional 108MB,
>> for a JVM total of 710MB allocated. If I understand correctly, Solr
>> memory utilization is
>> *not* for caching (unless I configured document caches or some of the
>> other cache options in Solr, which don't seem to apply in this case,
>> and I haven't altered from their defaults).
>
> Right.  Solr does have caches, but they serve specific purposes.  The OS is
> much better at general large-scale caching than Solr is.  Solr caches get
> cleared (and possibly re-warmed) whenever you issue a commit on your index
> that makes new documents visible.
>
>> So assuming this box was dedicated to 1 solr instance/shard. What JVM
>> heap should I set? Does that matter? 24GB JVM heap? Or keep it lower
>> and ensure the OS cache has plenty of room to operate? (this is an
>> Ubuntu 12.10 server instance).
>
> The JVM heap to use is highly dependent on the nature of your queries, the
> number of documents, the number of unique terms, etc.  The best thing to do
> is try it out with a relatively large heap, see how much memory actually
> gets used inside the JVM.  The jvisualvm and jconsole tools will give you
> nice graphs of JVM memory usage.  The jstat program will give you raw
> numbers on the commandline that you'll need to add to get the full picture.
> Due to the garbage collection model that Java uses, what you'll see is a
> sawtooth pattern - memory usage goes up to max heap, then garbage collection
> reduces it to the actual memory used.
>  Generally speaking, you want to have more heap available than the "low"
> point of that sawtooth pattern.  If that low point is around 3GB when you
> are hitting your index hard with queries and updates, then you would want to
> give Solr a heap of 4 to 6 GB.
>
>> Would I be wise to just put the index on a RAM disk and guarantee
>> performance?  Assuming I installed sufficient RAM?
>
> A RAM disk is a very good way to guarantee performance - but RAM disks are
> ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
> reindex.  Also remember that you actually need at *least* twice the size of
> your index so that Solr (Lucene) has enough room to do merges, and the
> worst-case scenario is *three* times the index size.  Merging happens during
> normal indexing, not just when you optimize.  If you have enough RAM for
> three times your index size and it takes less than an hour or two to rebuild
> the index, then a RAM disk might be a viable way to go.  I suspect that this
> won't work for you.
>
> Thanks,
> Shawn
>

Mime
View raw message