lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Replication on startup takes a long time
Date Mon, 25 Sep 2017 16:26:35 GMT

OK, thanks for pointing that out, that relieves me a lot!


On Mon, Sep 25, 2017 at 1:03 AM, Emir Arnautović
<> wrote:
> Hi Eric,
> I don’t think that there are some bugs with searcher reopening - this is a scenario
with a new slave:
> “But when I add a *new* slave pointing to the master…”
> So expected to have zero results until replication finishes.
> Regards,
> Emir
>> On 23 Sep 2017, at 19:21, Erick Erickson <> wrote:
>> First I'd like to say that I wish more people would take the time like
>> you have to fully describe the problem and your observations, it makes
>> it soooo much nicer than having half-a-dozen back and forths! Thanks!
>> Just so it doesn't get buried in the rest of the response, I do tend
>> to go on.... I suspect you have a suggester configured. The
>> index-based suggesters read through your _entire_ index, all the
>> stored fields from all the documents and process them into an FST or
>> "sidecar" index. See:
>> If this is true
>> they might be being built on the slaves whenever a replication
>> happens. Hmmm, if this is true, let us know. You can tell by removing
>> the suggester from the config and timing again. It seems like in the
>> master/slave config we should copy these down but don't know if it's
>> been tested.
>> If they are being built on the slaves, you might try commenting out
>> all of the buildOn.... bits on the slave configurations. Frankly I
>> don't know if building the suggester structures on the master would
>> propagate them to the slave correctly if the slave doesn't build them,
>> but it would certainly be a fat clue if it changed the load time on
>> the slaves and we could look some more at options.
>> Observation 1: Allocating 40G of memory for an index only 12G seems
>> like overkill. This isn't the root of your problem, but a 12G index
>> shouldn't need near 40G of JVM. In fact, due to MMapDirectory being
>> used (see Uwe Schindler's blog here:
>> I'd guess you can get away with MUCH less memory, maybe as low as 8G
>> or so. The wildcard here would be the size of your caches, especially
>> your filterCache configured in solrconfig.xml. Like I mentioned, this
>> isn't the root of your replication issue, just sayin'.
>> Observation 2: Hard commits (the <autocommit> setting is not a very
>> expensive operation with openSearcher=false. Again this isn't the root
>> of your problem but consider removing the number of docs limitation
>> and just making it time-based, say every minute. Long blog on the
>> topic here:
>> You might be accumulating pretty large transaction logs (assuming you
>> haven't disabled them) to no good purpose. Given your observation that
>> the actual transmission of the index takes 2 minutes, this is probably
>> not something to worry about much, but is worth checking.
>> Question 1:
>> Solr should be doing nothing other than opening a new searcher, which
>> should be roughly the "autowarm" time on master plus (perhaps)
>> suggester build. Your observation that autowarming takes quite a bit
>> of time (evidenced by much shorter times when you set the counts to
>> zero) is a smoking gun that you're probably doing far too much
>> autowarming. HOWEVER, during this interval the replica should be
>> serving queries from the old searcher so something else is going on
>> here. Autowarming is actually pretty simple, perhaps this will help
>> you to keep in mind while tuning:
>> The queryResultCache and filterCache are essentially maps where the
>> key is just the text of the clause (simplifying here). So for the
>> queryResultCache the key is the entire search request. For the
>> filterCache, the key is just the "fq" clause. autowarm count in each
>> just means the number of keys that are replayed when a new searcher is
>> opened. I usually start with a pretty small number, on the order of
>> 10-20. The purpose of them is just to keep from experiencing a delay
>> when the first few searches are performed after a searcher is opened.
>> My bet: you won't notice a measurable difference when dropping the
>> atuowarm counts drastically in terms of query response, but you will
>> save the startup time. I also suspect you can reduce the size of the
>> caches drastically, but don't know what you have them set to, it's a
>> guess.
>> As to what's happening such that you serve queries with zero counts,
>> my best guess at this point is that you are rebuilding
>> autosuggesters..... We shouldn't be serving queries from the new
>> searcher during this interval, if confirmed we need to raise a JIRA.
>> Question 2: see above, autosuggester?
>> Question 3a: documents should become searchable on the slave when 1>
>> all the segments are copied, 2> autowarm is completed. As above, the
>> fact that you get 0-hit responses isn't what _should_ be happening.
>> Autocommit settings are pretty irrelevant on the slave.
>> Question 3b: soft commit on the master shouldn't affect the slave at all.
>> The fact that you have 500 fields shouldn't matter that much in this
>> scenario. Again, the fact that removing your autowarm settings makes
>> such a difference indicates the counts are excessive, and I have a
>> secondary assumption that you probably have your cache settings far
>> higher than you need, but you'll have to test if you try to reduce
>> them.... BTW, I often find the 512 default setting more than ample,
>> monitor via admin UI>>core>>plugins/stats to see the hit ratio...
>> As I told you, I do go on....
>> Best,
>> Erick
>> On Sat, Sep 23, 2017 at 6:40 AM, yasoobhaider <> wrote:
>>> Hi
>>> We have setup a master-slave architecture for our Solr instance.
>>> Number of docs: 2 million
>>> Collection size: ~12GB when optimized
>>> Heap size: 40G
>>> Machine specs: 60G, 8 cores
>>> We are using Solr 6.2.1.
>>> Autocommit Configuration:
>>> <autoCommit>
>>>      <maxDocs>40000</maxDocs>
>>>      <maxTime>900000</maxTime>
>>>      <openSearcher>false</openSearcher>
>>> </autoCommit>
>>> <autoSoftCommit>
>>>      <maxTime>${solr.autoSoftCommit.maxTime:3600000}</maxTime>
>>> </autoSoftCommit>
>>> I have setup the maxDocs at 40k because we do a heavy weekly indexing, and I
>>> didn't want a lot of commits happening too fast.
>>> Indexing runs smoothly on master. But when I add a new slave pointing to the
>>> master, it takes about 20 minutes for the slave to become queryable.
>>> There are two parts to this latency. First, it takes approximately 13
>>> minutes for the generation of the slave to be same as master. Then it takes
>>> another 7 minutes for the instance to become queryable (it returns 0 hits in
>>> these 7 minutes).
>>> I checked the logs and the collection is downloaded within two minutes.
>>> After that, there is nothing in the logs for next few minutes, even with
>>> LoggingInfoSteam set to 'ALL'.
>>> Question 1. What happens after all the files have been downloaded on slave
>>> from master? What is Solr doing internally that the generation sync up with
>>> master takes so long? Whatever it is doing, should it take that long? (~5
>>> minutes).
>>> After the generation sync up happens, it takes another 7 minutes to start
>>> giving results. I set the autowarm count in all caches to 0, which brought
>>> it down to 3 minutes.
>>> Question 2. What is happening here in the 3 minutes? Can this also be
>>> optimized?
>>> And I wanted to ask another unrelated question regarding when a slave become
>>> searchable. I understand that documents on master become searchable if a
>>> hard commit happens with openSearcher set to true, or when a soft commit
>>> happens. But when do documents become searchable on a slave?
>>> Question 3a. When do documents become searchable on a slave? As soon as a
>>> segment is copied over from master? Does softcommit make any sense on a
>>> slave, as we are not indexing anything? Does autocommit with opensearcher
>>> true affect slave in any way?
>>> Question 3b. Does a softcommit on master affect slave in any way? (I only
>>> have commit and startup options in my replicateAfter field in solrconfig)
>>> Would appreciate any help.
>>> PS: One of my colleague said that the latency may be because our schema.xml
>>> is huge (~500 fields). Question 4. Could that be a reason?
>>> Thanks
>>> Yasoob Haider
>>> --
>>> Sent from:

View raw message