Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 51CDB103EB for ; Wed, 13 Nov 2013 01:20:27 +0000 (UTC) Received: (qmail 89224 invoked by uid 500); 13 Nov 2013 01:20:26 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 89155 invoked by uid 500); 13 Nov 2013 01:20:26 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 89145 invoked by uid 99); 13 Nov 2013 01:20:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Nov 2013 01:20:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of metacret@gmail.com designates 74.125.83.47 as permitted sender) Received: from [74.125.83.47] (HELO mail-ee0-f47.google.com) (74.125.83.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Nov 2013 01:20:21 +0000 Received: by mail-ee0-f47.google.com with SMTP id c13so3537428eek.34 for ; Tue, 12 Nov 2013 17:20:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9I46aqf0hgGtxra+6DDMIX/SeqHosR1XJYpOd8B/DCQ=; b=Sv+ux65McuxLwtRUmYmgSTq2Ml6OoQ1MfphLyJTYKISSyF/E1UqHDp0VictvF3emwL dezM6yFzmkP4zt5P30YTvf2Y9JyZsB7dUItmOkx/yal4zfsuKYSX07qKoKYY9zEVzb0F OzZvjhutZxhGvtDkPTVOAPBchW76P6D3O5dWbX1ujhp8fv6ChoTPAWnF1N3trFdqxm6X kGYLYl10gTIiwRlUjtyZQdwthGfEbXTw0nfz4esAWkerN6To3LXulCHtp/28yf1WvDAW 4h0xfW1wA7hxyCfLuB9RbLrMnEYiflu24MqFFsMVx+FVFzsKC6uRARCq2xdrfvozjqAT S9Bw== MIME-Version: 1.0 X-Received: by 10.15.32.73 with SMTP id z49mr10480958eeu.27.1384305599982; Tue, 12 Nov 2013 17:19:59 -0800 (PST) Received: by 10.223.201.132 with HTTP; Tue, 12 Nov 2013 17:19:59 -0800 (PST) In-Reply-To: References: Date: Tue, 12 Nov 2013 17:19:59 -0800 Message-ID: Subject: Re: How to join quorum without restarting existing servers From: "Bae, Jae Hyeon" To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=089e0160c5cc8c316304eb04c368 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160c5cc8c316304eb04c368 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thanks a lot German. Now, I can understand its strange behavior, so we decided to use IP address itself as a server list, instead of hostname. The problem went away. On Wed, Nov 6, 2013 at 8:34 PM, German Blanco < german.blanco.blanco@gmail.com> wrote: > Hello again, > > I don't think it is a good a idea to start a new thread with the same > issue. Please continue in the latest thread. > > could this be a DNS resolution caching problem? > See https://issues.apache.org/jira/browse/ZOOKEEPER-1506 > > The new server has the lowest sid. It is able to connect to all other > servers, but the rest of the servers don't seem able to connect to it. > Connections from this server to the rest are useless, since they are > dropped because of the sid comparison that you see in the log. > > You could try to change the server address in the configuration for the A= WS > public IP address of the peers, just to test if that works ok. Or try > replacing the server with the highest sid, that should also work. Otherwi= se > (assuming the problem is DNS resolution), the only current workaround tha= t > I can think of is the rolling restart, as you have noticed. > > > On Wed, Nov 6, 2013 at 6:39 PM, Diego Oliveira wrote: > > > Bae, > > > > Just a note, when using Zookeeper in amazon AWS, the instance IP > > relocation at restart is a nightmare. One solution is to do as you sad, > > using an elastic IP, but the max number 5 is limiting. One option is to > > configure a VPC. I got this problems last year. > > > > Att, > > Diego. > > > > > > On Tue, Nov 5, 2013 at 4:18 PM, Bae, Jae Hyeon > wrote: > > > > > I am attaching log file. Could you take a look why the new instance > > cannot > > > join quorum? > > > > > > > > > On Tue, Nov 5, 2013 at 9:52 AM, Bae, Jae Hyeon > > wrote: > > > > > >> Thanks a lot Ben > > >> > > >> We are also using zookeeper in AWS with elastic IP. Why I asked this > > >> question is, when the bad Zookeeper EC2 instance is terminated and n= ew > > >> instance is launched with the previous elastic IP, it cannot join > quorum > > >> without any specific error messages. But when I did rolling restart, > the > > >> new instance started normally, synchronized and joined quorum. > > >> > > >> As I understand German's response, the new instance should start, > > >> synchronize, and join quorum successfully without any impact on > existing > > >> instances but it didn't. I will investigate further. > > >> > > >> Thank you > > >> Best, Jae > > >> > > >> > > >> On Tue, Nov 5, 2013 at 8:24 AM, Ben Hall wrote: > > >> > > >>> Hi Jae, > > >>> > > >>> I wrote that article several years ago. (tbh - I hope it is not > totally > > >>> out of date by now). I agree with German's points. > > >>> > > >>> The issue it was solving was to replace a bad server without having > to > > >>> shutdown the ensemble and without having to update the config files > on > > >>> each server. I would also add that this only works as long as the > > server > > >>> names and ports are the same - iirc at the time the article was > written > > >>> we > > >>> were using servers in AWS and referencing them either by assigned > > >>> hostnames such as zookeeper-[01|11] or by elastic IP's that could b= e > > >>> moved > > >>> from server to server. > > >>> > > >>> If I understand your question correctly, if you are "adding a new > > server" > > >>> such as going from 7 to 9 servers, then this approach won't benefit > you > > >>> as > > >>> you. > > >>> > > >>> We also used this approach when we would upgrade the servers, but > like > > >>> German said we did it one server at a time so that the Leader > election > > >>> could be natural. This allowed us to upgrade a pool of 11 servers > who > > >>> were responsible for many thousands of client connections without a= ny > > >>> down > > >>> time. > > >>> > > >>> Thanks > > >>> Ben > > >>> > > >>> > > >>> On 11/5/13 6:51 AM, "German Blanco" > > >>> wrote: > > >>> > > >>> >... and make sure that there is no rubbish in the data dir of the > new > > >>> >server. > > >>> > > > >>> > > > >>> >On Tue, Nov 5, 2013 at 3:49 PM, German Blanco < > > >>> >german.blanco.blanco@gmail.com> wrote: > > >>> > > > >>> >> Hello Jae, > > >>> >> > > >>> >> I think that the answer to your question is "no, there is no > benefit > > >>> in > > >>> >>a > > >>> >> rolling restart in that case". > > >>> >> If you remove a machine that was hosting a zookeeper server that > was > > >>> >>part > > >>> >> of a cluster, and replace it with a new machine, with a zookeepe= r > > >>> server > > >>> >> running the same software version and listening on the same IP a= nd > > >>> >>ports, > > >>> >> then this new server will join the cluster, synchronize and star= t > > >>> >>working > > >>> >> normally. > > >>> >> I wouldn't recommend to replace more than one server at a time, > and > > I > > >>> >> think that it is better if the new server joins while the existi= ng > > >>> >>quorum > > >>> >> is stable (avoid leader elections while the new server joins, i.= e. > > >>> avoid > > >>> >> restarts or disconnections of the existing servers). > > >>> >> > > >>> >> Best regards, > > >>> >> > > >>> >> Germ=E1n. > > >>> >> > > >>> >> > > >>> >> On Tue, Nov 5, 2013 at 6:42 AM, Bae, Jae Hyeon < > metacret@gmail.com> > > >>> >>wrote: > > >>> >> > > >>> >>> Hi > > >>> >>> > > >>> >>> I read an article > > >>> >>> > > >>> >>> > > >>> >>> > > >>> > > http://www.benhallbenhall.com/2011/07/rolling-restart-in-apache-zookeep= e > > >>> >>>r-to-dynamically-add-servers-to-the-ensemble/ > > >>> >>> > > >>> >>> My question is, even though failed hardware is replaced with th= e > > same > > >>> >>>IP > > >>> >>> address, do I need to do rolling restart for adding replaced > > hardware > > >>> >>>to > > >>> >>> the quorum? > > >>> >>> > > >>> >>> I am using zookeeper ver3.4.5. > > >>> >>> > > >>> >>> Thank you > > >>> >>> Best, Jae > > >>> >>> > > >>> >> > > >>> >> > > >>> > > >>> > > >> > > > > > > > > > -- > > Att. > > Diego de Oliveira > > System Architect > > diego@diegooliveira.com > > www.diegooliveira.com > > Never argue with a fool -- people might not be able to tell the > difference > > > --089e0160c5cc8c316304eb04c368--