lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (Jira)" <>
Subject [jira] [Updated] (SOLR-13975) ConcurrentUpdateSolrClient connection stall prevention
Date Wed, 04 Dec 2019 20:53:00 GMT


Andrzej Bialecki updated SOLR-13975:
    Attachment: SOLR-13975.patch

> ConcurrentUpdateSolrClient connection stall prevention
> ------------------------------------------------------
>                 Key: SOLR-13975
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.3, 8.4
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 8.4
>         Attachments: SOLR-13975.patch, SOLR-13975.patch
> When a Solr process, which hosts replicas of a collection, is suspended - that is, the
OS process is suspended using eg. {{kill -STOP <pid>}} - a long stall may occur in CUSC
until a socket timeout is reached.
> During this stall updates from the leader are not forwarded to any replica, even though
other replicas are still active and can receive updates.  If the sender uses CUSC (eg. via
{{CloudSolrClient}}) then it becomes stalled because the leader stops processing updates,
> This situation is caused by several issues:
> * when a process is suspended its sockets remain open - so there is no immediate disconnect
as if the process died, but the process becomes unresponsive. Eventually, a socket timeout
will be reached (distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this
is set to 10 min. During this time all indexing to that shard will be stuck.
> * there are several infinite {{for}} loops in CUSC (eg. in {{blockUntilFinished}}, {{waitForEmptyQueue}}
and even in {{request}}), which rely either on the relatively quick success of the call or
an exception to be thrown. However, in this situation neither happens quickly - the call is
stuck waiting for the remote end until soTimeout expires.
> This issue proposes to add a stall prevention logic, which would break these infinite
loops long before the socket timeout occurs based on the progress of the queue processing.
> This is a follow-up to SOLR-13896.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message