zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shelley, Ryan" <Ryan.Shel...@disney.com>
Subject Re: Input on a change
Date Fri, 13 Apr 2012 18:22:13 GMT
Just my 2 centsÅ  is the error code 1 the correct error code to return to
the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
may be called. It may make sense to either re-use that error code, or use
a different one (if 1 is already used elsewhere for a different type of
error, like "Invalid arguments" during start-up, for example).

If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
inform a cluster it's going down abruptly (maybe to gracefully begin a
leader re-election if necessary, for example)?

I'm +1 to fail-fast behavior.


On 4/13/12 8:15 AM, "Scott Fines" <scottfines@gmail.com> wrote:

>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>remember off the top of my head). Will these triggers still be fired, or
>will the catch-all prevent them?
>I'm still +1 for the change no matter what, but it's probably a good idea
>to mention that in the docs if they don't work.
>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
>> Hi everyone,
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>> The current behavior of ZK when we get an uncaught exception is to log
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable
>> state depending on the thread), I think this makes sense, but I would
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and
>> restart) than possibly linger unresponsive for a while, are there
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>> Thanks for your feedback,
>> Camille

View raw message