Return-Path: X-Original-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-accumulo-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0115A9599 for ; Wed, 15 Feb 2012 16:17:12 +0000 (UTC) Received: (qmail 50289 invoked by uid 500); 15 Feb 2012 16:17:11 -0000 Delivered-To: apmail-incubator-accumulo-user-archive@incubator.apache.org Received: (qmail 50272 invoked by uid 500); 15 Feb 2012 16:17:11 -0000 Mailing-List: contact accumulo-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: accumulo-user@incubator.apache.org Delivered-To: mailing list accumulo-user@incubator.apache.org Received: (qmail 50264 invoked by uid 99); 15 Feb 2012 16:17:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Feb 2012 16:17:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [206.112.75.238] (HELO iron-u-a-out.osis.gov) (206.112.75.238) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Feb 2012 16:17:05 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AscFAOHZO0+sEAbx/2dsb2JhbABDsESBD4FyAQEBAwESAmoLCwQHDS4iEwUdGSKHXZxWCp0di2AJBwINDRE9HYMgHQwCPgkUgx0EiE2MaZMG X-IronPort-AV: E=Sophos;i="4.73,424,1325480400"; d="scan'208";a="8572848" Received: from ghost-a.center.osis.gov (HELO mail-we0-f175.google.com) ([172.16.6.241]) by iron-u-a-in.osis.gov with ESMTP/TLS/RC4-SHA; 15 Feb 2012 11:15:40 -0500 Received: by werc1 with SMTP id c1so781198wer.6 for ; Wed, 15 Feb 2012 08:16:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.180.92.71 with SMTP id ck7mr45892257wib.3.1329322600538; Wed, 15 Feb 2012 08:16:40 -0800 (PST) Received: by 10.216.181.65 with HTTP; Wed, 15 Feb 2012 08:16:40 -0800 (PST) Received: by 10.216.181.65 with HTTP; Wed, 15 Feb 2012 08:16:40 -0800 (PST) In-Reply-To: <158413449.68371.1329322279137.JavaMail.root@linzimmb04o.imo.intelink.gov> References: <292814158.68322.1329321393860.JavaMail.root@linzimmb04o.imo.intelink.gov> <158413449.68371.1329322279137.JavaMail.root@linzimmb04o.imo.intelink.gov> Date: Wed, 15 Feb 2012 11:16:40 -0500 Message-ID: Subject: Re: Suspension From: John Vines To: accumulo-user@incubator.apache.org Content-Type: multipart/alternative; boundary=f46d043c807065a34504b903092d X-Virus-Checked: Checked by ClamAV on apache.org --f46d043c807065a34504b903092d Content-Type: text/plain; charset=ISO-8859-1 There are too many cases where a node legitimately died and we do not want it constantly coming back and bogging things down. How do you design it to restart the accidentally deaths but not the deserves it deaths? On Feb 15, 2012 11:11 AM, "Adam Fuchs" wrote: > This isn't really just a laptop problem. We also see hiccups in clusters > (admins accidentally the whole network, etc.) that we would want to > automatically recover from. I think having self-restarting processes could > be generally useful. > > I think that an option of not using zookeeper timeouts might lead to > abuse, and could be very bad for stability under rare failure modes. We > make a lot of assumptions throughout the code about these timeouts, and we > would have to reconsider a large part of that model. > > Adam > > > On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi < > billie.j.rinaldi@ugov.gov> wrote: > >> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < >> aaron@cordovas.org> wrote: >> > Such an option would have to be very conspicuous so that users don't >> > accidentally enable it and then wonder why bad tablet servers aren't >> > removed automatically from the cluster. >> >> We could call it laptop.mode. >> >> Billie >> > > --f46d043c807065a34504b903092d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

There are too many cases where a node legitimately died and we do not wa= nt it constantly coming back and bogging things down. How do you design it = to restart the accidentally deaths but not the deserves it deaths?

On Feb 15, 2012 11:11 AM, "Adam Fuchs"= <adam.p.fuchs@ugov.gov>= wrote:
This isn't really just a laptop problem. We also see hiccups in cluster= s (admins accidentally the whole network, etc.) that we would want to autom= atically recover from. I think having self-restarting processes could be ge= nerally useful.

I think that an option of not using zookeeper timeouts might= lead to abuse, and could be very bad for stability under rare failure mode= s. We make a lot of assumptions throughout the code about these timeouts, a= nd we would have to reconsider a large part of that model.

Adam


On We= d, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi <billie.j.rinaldi@ug= ov.gov> wrote:
On Wednesday, February 15, 2012 10:38:4= 1 AM, "Aaron Cordova" <aaron@cordovas.org> wrote:
> Such an option would have to be very conspicuous so that users don'= ;t
> accidentally enable it and then wonder why bad tablet servers aren'= ;t
> removed automatically from the cluster.

We could call it laptop.mode.

Billie

--f46d043c807065a34504b903092d--