lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hoy <>
Subject Partial replication blocks subsequent requests when using solrcloud and master/slave replication
Date Tue, 22 Nov 2016 18:02:30 GMT
Hi All,

We're running a fairly non-standard solr configuration.  We ingest into named shards in master
cores and then replicate out to slaves running solr cloud.  So in effect we are using solrcloud
only to manage the config files and more importantly to look after the cluster state.  Our
corpus and search workload, is such that this makes sense to reduce the need to query every
shard for each search since the majority of queries contain values that allow is to target
search towards the shards holding the appropriate documents, also this isolates the searching
slaves from the costs of indexing (we index fairly infrequently, but in fairly large volumes).
 I'm happy to expand on this if anyone's is interested or take suggestions as to how to we
might better be doing things.

We've been running 4.6.0 for the past 3 years or so, but have recently upgraded to 5.5.2 -
we'll likely be upgrading to 6.3.0 shortly.   However we hit a problem when running 5.5.2,
which we also replicated in 6.2.1 and 6.3.0.  When a partial replication starts this will
usually block all subsequent requests to solr, whilst replication continues in the background.
 Whilst in this blocked state we took thread dumps using VisualVM; we see this when running

"explicit-fetchindex-cmd" - Thread t@71
   java.lang.Thread.State: RUNNABLE
                at Method)
                at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(
                at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(
                at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(
                at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(
                at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(
                at org.apache.solr.handler.ReplicationHandler.doFetch(
                at org.apache.solr.handler.ReplicationHandler.lambda$handleRequestBody$0(
                at org.apache.solr.handler.ReplicationHandler$$Lambda$82/

   Locked ownable synchronizers:
                - locked <4c18799d> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

                - locked <64a00f> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)


"qtp1873653341-61" - Thread t@61
   java.lang.Thread.State: TIMED_WAITING
                at sun.misc.Unsafe.park(Native Method)
                - waiting to lock <64a00f> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
owned by "explicit-fetchindex-cmd" t@71
                at java.util.concurrent.locks.LockSupport.parkNanos(
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(
                at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(
                at org.apache.solr.update.DefaultSolrCoreState.lock(
                at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(
                at org.apache.solr.core.SolrCore.openNewSearcher(
                at org.apache.solr.core.SolrCore.getSearcher(
                at org.apache.solr.core.SolrCore.getSearcher(
                at org.apache.solr.core.SolrCore.getSearcher(

The cause of the problem seems to be that in IndexFetcher.fetchLatestIndex, when the running
as solrcloud, the searcher is shut down prior to cleaning up the existing segment files and
downloading the new ones.

6.3.0 - Lines(407-409)
                if (solrCore.getCoreDescriptor().getCoreContainer().isZooKeeperAware()) {

Subsequently solrCore.getUpdateHandler().newIndexWriter(true); takes a write lock on the indexwriter,
which is not released until the openIndexWriter call after the new files have been copied.
 So because openNewSearcher needs to take a read lock on the index writer, and it can't take
that whilst the write lock is in place, all subsequent requests are blocked.

To test this we queued up a load of search requests, then manually triggered replication,
reasoning that a new searcher might be created before the write lock is taken.  On a test
instance manually triggering replication would almost always result in all subsequent requests
being blocked, but when we queued up search requests and ran these whilst triggering replication
this never resulted in the blocking behaviour we were seeing.

We then patched solr locally, to comment out the closeSearcher call, on the basis that whilst
we are running solrcloud, if the core is also running as a slave there is no need to close
the searcher.  This seems to work fine; replication works, nothing hangs.

This seems like a bug to me, but we could find no other reports of the problem.

So my questions are:  Is it worth raising an issue in JIRA and working up a proper patch?
 Or is our setup so unique there is little value to this?  Or am I missing something else?



This message is confidential and may contain privileged information. You should not disclose
its contents to any other person. If you are not the intended recipient, please notify the
sender named above immediately. It is expressly declared that this e-mail does not constitute
nor form part of a contract or unilateral obligation. Opinions, conclusions and other information
in this message that do not relate to the official business of findmypast shall be understood
as neither given nor endorsed by it.


This email has been checked for virus and other malicious content prior to leaving our network.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message