Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38D35C810 for ; Mon, 10 Mar 2014 15:03:54 +0000 (UTC) Received: (qmail 92669 invoked by uid 500); 10 Mar 2014 15:03:53 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 92294 invoked by uid 500); 10 Mar 2014 15:03:51 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 92285 invoked by uid 99); 10 Mar 2014 15:03:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 15:03:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of flawaetz@gmail.com designates 209.85.192.48 as permitted sender) Received: from [209.85.192.48] (HELO mail-qg0-f48.google.com) (209.85.192.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 15:03:45 +0000 Received: by mail-qg0-f48.google.com with SMTP id j107so20905628qga.7 for ; Mon, 10 Mar 2014 08:03:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=9yQLZVShGWSkeYZSz7lkkDKx5NTRo2oJs6DYj/TqoFs=; b=itkmPyil0uih8DeKzRroG/qb7Z7F4/Vxwu89KBShQJO357QuILx0N0aN9xKAO571uw jJwHxgi9uRlPESPkRohHAQh7x3mNW/y9cqRTgogbIItoImyLYi2s9qeLNf7RD3nbkX5A QgDSHQgLo6r5XyXJ1XuGZqg9JsgSKwvRQ3wYQXXZx73f8fAhf9k93vOS1Ymwr5/8/1Y9 fUYsGZa3LsqFjaXi7TWlUSotA4AmSO7ouRY4zGnB/KSHbfB5WCklsOjOvt6a0N1tfNrq kkuBp1Rx49Qh1IPzUIKIW/9CKYJrqMvQuwersFZqoyxbTJ1dijZ3s3z09/bLGcO8p3zj hs0g== X-Received: by 10.224.136.195 with SMTP id s3mr2741421qat.95.1394463805064; Mon, 10 Mar 2014 08:03:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.37.74 with HTTP; Mon, 10 Mar 2014 08:03:04 -0700 (PDT) In-Reply-To: <5319FECF.1010605@gmail.com> References: <5319FECF.1010605@gmail.com> From: Frans Lawaetz Date: Mon, 10 Mar 2014 11:03:04 -0400 Message-ID: Subject: Re: Is it safe / advisable to increase Zookeeper timeout? To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a11c2c974c0cbd104f441e76a X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2c974c0cbd104f441e76a Content-Type: text/plain; charset=ISO-8859-1 On Fri, Mar 7, 2014 at 12:15 PM, Josh Elser wrote: > On 3/7/14, 12:01 PM, Terry P. wrote: > >> Greetings folks, >> It seems network woes will never go away for this Accumulo 1.4.2 project >> :-( >> >> They rebooted one of the two "redundant switches" last night, but of >> course zero redundancy actually took place and the Master lost his >> zookeeper lock as did one of the Datanodes after 60 seconds and shut >> itself down. >> > > By datanode you mean tserver? Hadoop datanodes don't communicate with > ZooKeeper. > > > The 60 second period is odd, because I see that >> instance.zookeeper.timeout is actually set to 30s, but I do recall that >> often by default zookeeper clients retry 2 times before bailing so maybe >> that's why. >> > > It won't always be 30s before it's seen; I've seen it much quicker too. > I'm not sure about the retries off the top of my head. Most likely you were seeing the effects of ACCUMULO-1572 in which a ZooKeeper disconnect causes Accumulo failure before the expiration of the session. Fixed in 1.5.1 and to-be-released 1.4.5. If you think you're seeing something else it would be good to hear about it. --001a11c2c974c0cbd104f441e76a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable



On Fri, Mar 7, 2014 at 12:15 PM, Josh Elser <<= a href=3D"mailto:josh.elser@gmail.com" target=3D"_blank">josh.elser@gmail.c= om> wrote:
On 3/7/14, 12:01 PM, Terry P. wrote:
Greetings folks,
It seems network woes will never go away for this Accumulo 1.4.2 project :-= (

They rebooted one of the two "redundant switches" last night, but= of
course zero redundancy actually took place and the Master lost his
zookeeper lock as did one of the Datanodes after 60 seconds and shut
itself down.

By datanode you mean tserver? Hadoop datanodes don't communicate with Z= ooKeeper.


The 60 second period is odd, because I see that
instance.zookeeper.timeout is actually set to 30s, but I do recall that
often by default zookeeper clients retry 2 times before bailing so maybe that's why.

It won't always be 30s before it's seen; I've seen it much quic= ker too. I'm not sure about the retries off the top of my head.

Most likely you were seeing the effects of ACCUMUL= O-1572 in which a ZooKeeper disconnect causes Accumulo failure before the e= xpiration of the session. =A0Fixed in 1.5.1 and to-be-released 1.4.5. =A0If= you think you're seeing something else it would be good to hear about = it.
--001a11c2c974c0cbd104f441e76a--