lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jessica Cheng Mallet (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6405) ZooKeeper calls can easily not be retried enough on ConnectionLoss.
Date Fri, 22 Aug 2014 19:02:11 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107289#comment-14107289
] 

Jessica Cheng Mallet edited comment on SOLR-6405 at 8/22/14 7:01 PM:
---------------------------------------------------------------------

Right, most likely the first time it hits the ConnectionLoss it's not time=0 of the connection
loss, so by loop i=4, it would've slept for 15s since the i=0 and therefore hit a SessionExpired.

But then, thinking about it again, why be clever at all about the padding or back-off?

Not to propose that we change this now, but let's pretend we don't do back-off and just sleep
1s between each loop. If we were to get ConnectionLoss back in the next attempt, there's no
harm to try at all because if we're disconnected, the attempt wouldn't be hitting zookeeper
anyway. If we were to get SessionExpired back, great, we can break out now and throw the exception.
If we've reconnected, then yay, we succeeded. Because with each call we're expecting to get
either success, failure (SessionExpired), or "in progress" (ConnectionLoss), we can really
just retry "forever" without limiting the loop count (unless we're worried that somehow we'll
keep getting ConnectionLoss even though the session has expired, but that'd be a pretty serious
zookeeper client bug. And if we're really worried about that, we can always say do 10 more
loops after we have slept a total of timeout already). The advantage of this approach is to
never sleep for too long before finding out the definitive answer of success or SessionExpired,
while if the answer is ConnectionLoss, it's not really incurring any extra load on zookeeper
anyway.

In the end, it's really weird that this method should ever semantically allow throwing a ConnectionLoss
exception, if we got the math wrong, because the intent is to retry until we get a SessionExpired,
isn't it?


was (Author: mewmewball):
Right, most likely the first time it hits the ConnectionLoss it's not time=0 of the connection
loss, so by loop i=4, it would've slept for 15s since the i=0 and therefore hit a SessionExpired.

But then, thinking about it again, why be clever at all about the padding or back-off?

Not to propose that we change this now, but let's pretend we don't do back-off and just sleep
1s between each loop. If we were to get ConnectionLoss back in the next attempt, there's no
harm to try at all because if we're disconnected, the attempt wouldn't be hitting zookeeper
anyway. If we were to get SessionExpired back, great, we can break out now and throw the exception.
If we've reconnected, then yay, we succeeded. Because with each call we're expecting to get
either success, failure (SessionExpired), or "in progress" (ConnectionLoss), we can really
just retry "forever" without limiting the loop count (unless we're worried that somehow we'll
keep getting ConnectionLoss even though the session has expired, but that'd be a pretty serious
zookeeper client bug. And if we're really worried about that, we can always say do 10 more
loops after we have slept a total of timeout already).

In the end, it's really weird that this method should ever semantically allow throwing a ConnectionLoss
exception, if we got the math wrong, because the intent is to retry until we get a SessionExpired,
isn't it?

> ZooKeeper calls can easily not be retried enough on ConnectionLoss.
> -------------------------------------------------------------------
>
>                 Key: SOLR-6405
>                 URL: https://issues.apache.org/jira/browse/SOLR-6405
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.10
>
>         Attachments: SOLR-6405.patch
>
>
> The current design requires that we are sure we retry on connection loss until session
expiration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message