lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (Jira)" <>
Subject [jira] [Assigned] (SOLR-13896) Paused a non-leader node can cause recovery on other nodes
Date Wed, 04 Dec 2019 11:20:00 GMT


Andrzej Bialecki reassigned SOLR-13896:

    Assignee: Andrzej Bialecki  (was: Cao Manh Dat)

> Paused a non-leader node can cause recovery on other nodes
> ----------------------------------------------------------
>                 Key: SOLR-13896
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Andrzej Bialecki
>            Priority: Major
>         Attachments: SOLR-13896.patch
> All stacktraces below based on 7.5 branch. This problem still exist at 8.x branches.
Here is the scenario, we have 3 replicas
>  * L: the leader replica
>  * R: the normal replica
>  * P: the poor one which was paused then resumed
> L is trying to send data to R, P during that P get paused, here is what happen at L's
>  * Thread 1 is stucking at this line of StreamingSolrClients
> {code:java}
> public synchronized void blockUntilFinished() {
>   for (ConcurrentUpdateSolrClient client : solrClients.values()) {
>     client.blockUntilFinished();
>   }
> } {code}
> basically this thread is trying to wait for other sender threads to finish. Let's assume
that this is the content of *solrClients.values : [clientToP, clientToR]*
>  * Thread 2 coressponds to *clientToP* since P is paused, it doesn't close the connection.
it just keep the connection and never return any data backs to L. So this thread stuck with
this stack trace, waiting for response data from *P* (with timeout=600000ms)*.* Therefore
it cause the thread1 stuck at *clientToP.blockUntilFinished()*
> {code:java}
>     java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at
Method) at at
at at
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(
at at
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader( at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(
at org.apache.http.protocol.HttpRequestExecutor.execute( at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(
at org.apache.http.impl.execchain.MainClientExec.execute( at org.apache.http.impl.execchain.ProtocolExec.execute(
at org.apache.http.impl.execchain.RetryExec.execute( at org.apache.http.impl.execchain.RedirectExec.execute(
at org.apache.http.impl.client.InternalHttpClient.doExecute( at
org.apache.http.impl.client.CloseableHttpClient.execute( at org.apache.http.impl.client.CloseableHttpClient.execute(
at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream({code}
>  * Since *clientToR* is the second element of the array,   is never get called (or
at least after the timeout). This problem cause Thread 3, to stuck at this line
> {code:java}
> upd = queue.poll(pollQueueTime, TimeUnit.MILLISECONDS); {code}
> note that pollQueueTime == Integer.MAX_VALUE (this set by StreamingSolrClients). Therefore
unless clientToR.blockUntilFinished() is called (which interrupt Thread 3) this Thread 3 will
stuck at above line forever
>  * because *clientToR* is sending data to R but never close the outputstream, so basically
R just waiting forever (until timeout at 120000ms later). Which then lead to this exception
> {code:java}
> o.a.s.h.RequestHandlerBase java.util.concurrent.TimeoutException:
Idle timeout expired: 120003/120000 mso.a.s.h.RequestHandlerBase java.util.concurrent.TimeoutException:
Idle timeout expired: 120003/120000 ms at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(
at at
at at org.apache.solr.common.util.FastInputStream.readWrappedStream(
at org.apache.solr.common.util.FastInputStream.refill( at org.apache.solr.common.util.FastInputStream.peek(
at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs( at
org.apache.solr.handler.loader.JavabinLoader.load( {code}
>  * After that the leader put all replicas including none-paused one to recovery
> It is a very bad outcome and, this is not just theoretical problem since some cloud platforms
can freeze a node when doing maintenance.
> Thanks [~ab]  and [~shalin] on helping me debugging this problem.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message