Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F3795200B9D for ; Thu, 13 Oct 2016 19:17:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F22B6160AE4; Thu, 13 Oct 2016 17:17:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 1C208160AD2 for ; Thu, 13 Oct 2016 19:17:07 +0200 (CEST) Received: (qmail 59833 invoked by uid 500); 13 Oct 2016 17:17:07 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 59817 invoked by uid 99); 13 Oct 2016 17:17:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2016 17:17:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id ED26D18068F for ; Thu, 13 Oct 2016 17:17:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.479 X-Spam-Level: ** X-Spam-Status: No, score=2.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=avinetworks-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ZQ8D5D_9YEjW for ; Thu, 13 Oct 2016 17:17:04 +0000 (UTC) Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9E65C5F24F for ; Thu, 13 Oct 2016 17:17:03 +0000 (UTC) Received: by mail-oi0-f54.google.com with SMTP id d132so106631400oib.2 for ; Thu, 13 Oct 2016 10:17:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=avinetworks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=laDYw7WCB1ZXeuy71vkf4V1KLQrXmx8yMci1QvTt9c8=; b=0aWlf6nLmQc3z+HbL23O3nYfASuXzslzYbpBoLfkF6tECC8+phZG8I6XirJP+UpOyn 5idirGYd6uA3M1HUrFdSZvUpy4v4FfaTd+nfjT29bkJz3bA+s4MvUr1AZhupLLSa8rS8 M/wOilOe7HpVanJcqoTgkTSePo0SWDlzQXQ2QqUuf6Pa0am7a2wiFzBu5QqCmkXRZyMh Jd9wpb8EqZZAJyBuadhjtG4/4HGIiQDNq6uByMc6DyPRMVe0p5yid4dcgWlM4xUfVALp qoNKnPZpWGXkRPFPgIYOPJr/hxewBsmmmU9lYeHXGao5jHG74wZAIsAgC5UJtYUdt6vV /PYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=laDYw7WCB1ZXeuy71vkf4V1KLQrXmx8yMci1QvTt9c8=; b=VUH1LxwWjUYhBpSxKZ2SdP52wU5k6C2GUTFxa908f/IlFmq/KxY7t8S4OlL9GiwaNT EQMUDeoQjNzqwVZNFtQcZZ8IZtREg/ws+aZdS7OtXQWnzX+kFydfWf542smUvCNSNAfG MNO0eAmoyTGlTiiI4glKK8B7FwtxNub8f+MNSKHZ/XF+s+fRgO6scxbgZMP4Oh16y3Nb gB5IKGYNw+2tMk3EQMoatL8a8lBs71KIiaSGGmeUHk2N8SX2oF4WIJYMUhoaWWwhSm4X Kd5v+Rjn7Gr8VfhvRDfugAtvi+Wi75AoG0pSMiOdA9Di1rdR/E+/+4V35wNmyhTp6cch 4BOw== X-Gm-Message-State: AA6/9Rn+by+AbxOrRtKo7CTzvIHnKUPURjR/nO3PQa29ujquz+gQxNzpY0hMTJpMw6lz2/7D3653f+MWQMicUg== X-Received: by 10.157.37.114 with SMTP id j47mr3559164otd.147.1476379017642; Thu, 13 Oct 2016 10:16:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.52.194 with HTTP; Thu, 13 Oct 2016 10:16:57 -0700 (PDT) In-Reply-To: References: <1A9BB7DA-9D76-40CF-92C8-743A3B418743@apache.org> From: Anand Parthasarathy Date: Thu, 13 Oct 2016 10:16:57 -0700 Message-ID: Subject: Re: Zookeeper leader election takes a long time. To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=001a11415444e68e5d053ec24669 archived-at: Thu, 13 Oct 2016 17:17:09 -0000 --001a11415444e68e5d053ec24669 Content-Type: text/plain; charset=UTF-8 Just wanted to let you know that at this time, one of the node is powered off and the other two nodes took more than 10 minutes to converge. Our script exits and so, we don't know when it exactly converged. Normally, it takes < 100 seconds to converge. Thanks, Anand. On Thu, Oct 13, 2016 at 10:09 AM, Anand Parthasarathy < anpartha@avinetworks.com> wrote: > Hi Michael, > > We have reproduced this issue on a private AWS setup that has public IP > access. I will send you the details of the instance IP and the credentials > separately. If it needs to be shared with more people, I am happy to share > with them as well. > > Thanks > Anand. > > On Tue, Oct 11, 2016 at 3:46 PM, Michael Han wrote: > >> Hi Anand, >> >> >> We have isolated it to a test setup, where we are able >> to reproduce this somewhat consistently if we keep a node powered off. >> >> Do you mind share your setup / steps to reproduce if the setup only >> involves ZooKeeper without other dependencies? >> >> >> On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy < >> anpartha@avinetworks.com> wrote: >> >> > Folks, >> > >> > Sending a quick note again to find out if there is any insight the >> > community can offer in terms of a solution or workaround? We use >> zookeeper >> > for service discovery in our product and this issue has surfaced in a >> large >> > customer site a couple of times and we need to figure out a solution >> soon. >> > >> > Thanks, >> > Anand. >> > >> > On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy < >> > anpartha@avinetworks.com> wrote: >> > >> > > Folks, >> > > >> > > Any insight into this or any workarounds that you can think of to >> > mitigate >> > > against this issue? We have isolated it to a test setup, where we are >> > able >> > > to reproduce this somewhat consistently if we keep a node powered off. >> > > >> > > Thanks, >> > > Anand. >> > > >> > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy < >> > > anpartha@avinetworks.com> wrote: >> > > >> > >> Hi Flavio, >> > >> >> > >> I have attached the logs from node 1 and node 3. Node 2 was powered >> off >> > >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16 >> > when it >> > >> finally converged. >> > >> >> > >> Thanks, >> > >> Anand. >> > >> >> > >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira >> > wrote: >> > >> >> > >>> Hi Anand, >> > >>> >> > >>> I don't understand whether 1 and 3 were able or even trying to >> connect >> > >>> to each other. They should be able to elect a leader between them >> and >> > make >> > >>> progress. You might want to upload logs and let us know. >> > >>> >> > >>> -Flavio >> > >>> >> > >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy < >> > >>> anpartha@avinetworks.com> wrote: >> > >>> > >> > >>> > Hi, >> > >>> > >> > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node >> > >>> solution in >> > >>> > our system. We see that occasionally, when a node is powered off >> (in >> > >>> this >> > >>> > instance, it was actually a leader node), the remaining two nodes >> do >> > >>> not >> > >>> > form a quorum for a really long time. Looking at the logs, it >> appears >> > >>> the >> > >>> > sequence is as follows: >> > >>> > - Node 2 is the zookeeper leader >> > >>> > - Node 2 is powered off >> > >>> > - Node 1 and Node 3 recognize and start the election >> > >>> > - Node 3 times out after initLimit * tickTime with "Timeout while >> > >>> waiting >> > >>> > for quorum" for Round N >> > >>> > - Node 1 times out after initLimit * tickTime with "Exception >> while >> > >>> trying >> > >>> > to follow leader" for Round N+1 at the same time. >> > >>> > - And the process continues where N is sequentially incrementing. >> > >>> > - This happens for a long time. >> > >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it >> took >> > >>> > around 3.5 hours to converge. >> > >>> > - In a given round, Node 1 will try connecting to Node 2, gets >> > >>> connection >> > >>> > refused waits for notification timeout which increases by 2 every >> > >>> iteration >> > >>> > until it hits the initLimit. Connection Refused is because the >> node 2 >> > >>> comes >> > >>> > up after reboot, but zookeeper process is not started (due to a >> > >>> different >> > >>> > failure). >> > >>> > >> > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection >> > timeout >> > >>> > where Node 2 is not reachable. >> > >>> > >> > >>> > Could you pls. share if you have seen this issue and if so, what >> is >> > the >> > >>> > workaround that can be employed in 3.4.6. >> > >>> > >> > >>> > Thanks, >> > >>> > Anand. >> > >>> >> > >>> >> > >> >> > > >> > >> >> >> >> -- >> Cheers >> Michael. >> > > --001a11415444e68e5d053ec24669--