lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-10914) RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded
Date Mon, 03 Jul 2017 13:38:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shalin Shekhar Mangar updated SOLR-10914:
-----------------------------------------
    Attachment: SOLR-10914.patch

Fixed wrong code comment in the test in previous patch.

> RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-10914
>                 URL: https://issues.apache.org/jira/browse/SOLR-10914
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.4, 6.5, 6.6
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: master (7.0)
>
>         Attachments: SOLR-10914.patch, SOLR-10914.patch, SOLR-10914.patch
>
>
> tl;dr; a recovering replica is stuck for 5 minutes in the prep recovery request if the
leader core is unloaded before the prep recovery request is made.
> SOLR-9716 changed the sendPrepRecoveryCmd to retry on read timeouts (earlier it had no
connection/read timeout at all) but the fix has caused another problem. Say 
> # A replica starts up (or is newly created) and goes into recovery, 
> # Replica finds that leader=X
> # The core X is unloaded but the node that used to host X is still running and taking
requests
> # Replica calls sendPrepRecoveryCmd to X
> At this point, the node X receives the prep recovery command, finds that the core X does
not exist and keeps checking again in a sleep-loop until a timeout happens. I am not sure
why prep recovery core admin command needs to continue waiting if a local core does not exist.
The default timeout here is usually longer than 10 seconds.
> On the recovering replica's side, the prep recovery has a connection/read timeout of
only 10s, so the request always times out and is retried upto 5 minutes. Only then does the
recovery attempt fails and may be restarted again with the right leader URL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message