curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Szekeres, Zoltan" <Zoltan.Szeke...@morganstanley.com>
Subject Curator retry policies
Date Fri, 22 Jul 2016 09:56:19 GMT
Hi Curator team,



We've three retry related questions.


1.       We're trying to decide, which retry policy we should set. Our desired behavior is
to retry until succeeded with an exponential back-off up to a max limit of wait. However current
ExponentialBackoffRetry implementation doesn't allow having an unbounded number of retries.
I've found the change[1] for adding maximum number of retries to ExponentialBackoffRetry,
but it suggests that the reason was integer overflow. I'm happy to write my own policy, but
do you know any reason not to allow unbounded number of retries?


2.       Another issue we've faced is that our users might not always set the ACL entries
correctly on the nodes and because of this they receive NOAUTH errors. We're using PersistentEphemeralNode
and PathChildrenCache recipes and the behavior we'd like is to retry (with an exponential
back-off) until the ACLs are corrected. However none of the mentioned recipes retries on NO_AUTH
error.



A possible solution would be to configure the CuratorFramework to retry on NOAUTH code, but
the retriable result codes are hard coded in RetryLoop. As a feature request can the retriable
result codes can be made configurable via the CuratorFramework.



The solution we've tried is to add a new field to CuratorFrameworkImpl, which is a Set of
KeeperException.Code and initialize it through the builder. At CuratorFrameworkImpl#processBackgroundOperation
in the condition for retrying we've also tested whether the result code is in the Set. This
way we're able to retry with an exponential back-off for NOAUTH result codes.


3.       During my investigation with the retry policy it occurred to me that the SharedValue
recipe reads the value of the node synchronously when a watch event is triggered. However
it doesn't check the keeper state and it sends the request even, when the state is "Disconnected".
This'll block the zookeeper event thread until the request's retries are exhausted, which
could be quite long based on the retry policy in use and it delays the delivery of the disconnect
event to other listeners. I think in this case the request might be not sent if disconnected
and sent, when a reconnect even arrives or send the read asynchronously.



Any advice is appreciated.



Kind regards,

Zoltan Szekeres



[1] https://github.com/Netflix/curator/commit/3c1b1b4dbf256e318b803e7bbcc2a3dcd2b88619



________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained
herein are not intended to be, and do not constitute, advice within the meaning of Section
975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received
this communication in error, please destroy all electronic and paper copies; do not disclose,
use or act upon the information; and notify the sender immediately. Mistransmission is not
intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the
extent permitted under applicable law, to monitor electronic communications. This message
is subject to terms available at the following link: http://www.morganstanley.com/disclaimers
If you cannot access these links, please notify us by reply message and we will send the contents
to you. By messaging with Morgan Stanley you consent to the foregoing.

Mime
View raw message