zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shelley, Ryan" <Ryan.Shel...@disney.com>
Subject Re: Input on a change
Date Fri, 13 Apr 2012 18:37:07 GMT
And to go just a bit further with this, would this be considered a
non-backward compatible change to the core expectations of the system
(even if it's patching a "bug")? It changes the "expected" behavior of the
application from an undefined possible zombie state to fail-fast.
Theoretically, that might constitute a version bump (http://semver.org),
unless you parameterize the behavior so that a user, who encounters this
behavior outside of their existing expectations, could turn off the
fail-fast behavior by setting an appropriate flag. I'm not an expert in ZK
whatsoever, but maybe there's some runtime operation that occurs and
generates an unchecked exception and the system is fine with that, but
it's only really experienced at runtime in exceptional cases. If the
system fails-fast, without a way to disable fail-fast, you may have to
release another patch promptly (instead of simply saying "disable
fail-fast for now, we'll fix the larger issue in the next release"). Maybe
that's excessive, but just putting it out there anyway.

On 4/13/12 11:22 AM, "Shelley, Ryan" <Ryan.Shelley@disney.com> wrote:

>Just my 2 centsÅ  is the error code 1 the correct error code to return to
>the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
>may be called. It may make sense to either re-use that error code, or use
>a different one (if 1 is already used elsewhere for a different type of
>error, like "Invalid arguments" during start-up, for example).
>
>If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
>inform a cluster it's going down abruptly (maybe to gracefully begin a
>leader re-election if necessary, for example)?
>
>I'm +1 to fail-fast behavior.
>
>Ryan
>
>On 4/13/12 8:15 AM, "Scott Fines" <scottfines@gmail.com> wrote:
>
>>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>>remember off the top of my head). Will these triggers still be fired, or
>>will the catch-all prevent them?
>>
>>I'm still +1 for the change no matter what, but it's probably a good idea
>>to mention that in the docs if they don't work.
>>
>>Scott
>>
>>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
>><camille@apache.org>wrote:
>>
>>> Hi everyone,
>>>
>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>>I'd
>>> like some feedback from the user base on it.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>
>>> The current behavior of ZK when we get an uncaught exception is to log
>>>it
>>> and try to move on. This is arguably not the right thing to do, and
>>>will
>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
>>>for
>>> longer than it should.
>>> The patch proposes that when we get an instance of java.lang.Error, we
>>> should do a system.exit to fast-fail the process. With the possible
>>> exception of ThreadDeath (which may or may not be an unrecoverable
>>>system
>>> state depending on the thread), I think this makes sense, but I would
>>>like
>>> to hear from others if they have an opinion. I think it's better to
>>>kill
>>> the process and let your monitoring services detect process death (and
>>>thus
>>> restart) than possibly linger unresponsive for a while, are there
>>>scenarios
>>> that we're missing where this error can occur and you wouldn't want the
>>> process killed?
>>>
>>> Thanks for your feedback,
>>>
>>> Camille
>>>
>


Mime
View raw message