Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 469F510989 for ; Thu, 23 Jan 2014 19:43:27 +0000 (UTC) Received: (qmail 98578 invoked by uid 500); 23 Jan 2014 19:43:25 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 98473 invoked by uid 500); 23 Jan 2014 19:43:25 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 98465 invoked by uid 99); 23 Jan 2014 19:43:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 19:43:25 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_SOFTFAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of deepak.jagtap@maxta.com does not designate 209.85.214.181 as permitted sender) Received: from [209.85.214.181] (HELO mail-ob0-f181.google.com) (209.85.214.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 19:43:19 +0000 Received: by mail-ob0-f181.google.com with SMTP id va2so2572699obc.12 for ; Thu, 23 Jan 2014 11:42:58 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=WeqXKoJ7fEzbRNKECh/Y0Qc1oNlqtmkWP/hBCRu1RVQ=; b=YtB5g+sWw1dpS9FjVteeziWGZoGievDW1TnuFPlKrhwzi+gx0hU0itJEi7Qmlz6/DU WOaG/i2FuRCqaLmSAnJQopxva0aDrw6qUppKDmCVMx77u8LUWe13kF3CDUzxkvJRyUpY Ywqow+Vg0UMKVGKVt0qlVfhQ4DZlPLlAK0dXeTR4PocNP2pyILGIclamZa0snN3JToAq dLwqbu71/17w7++h3qU+UrRW/WZc+y68xxXQqlv8pOTVPmz8k/IYovPOdVyHt59Ddhza H+UbfqaLEBWqWmKnZ23fBK9SdPBcbj5yHI86MqWnBO/95cSb5ZRTjWnWvljYwymVwONE YEPg== X-Gm-Message-State: ALoCoQnoPiUjRhb9hK4Ij5B8ofFOQVn7pYX/wXqAzOxtJsL1u9Ow4lJjmfXKwb3W05D6N5Jogzvj MIME-Version: 1.0 X-Received: by 10.182.250.163 with SMTP id zd3mr8433656obc.20.1390506177997; Thu, 23 Jan 2014 11:42:57 -0800 (PST) Received: by 10.60.46.5 with HTTP; Thu, 23 Jan 2014 11:42:57 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 11:42:57 -0800 Message-ID: Subject: Re: zk server falling apart from quorum due to connection loss and couldn't connect back From: Deepak Jagtap To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=089e01634ec2cc551b04f0a872f7 X-Virus-Checked: Checked by ClamAV on apache.org --089e01634ec2cc551b04f0a872f7 Content-Type: text/plain; charset=ISO-8859-1 Hi, zoo.cfg is : maxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic zoo.cfg.dynamic is: server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181 server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181 server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181 version=1 Thanks & Regards, Deepak On Thu, Jan 23, 2014 at 11:30 AM, German Blanco < german.blanco.blanco@gmail.com> wrote: > Sorry but the attachment didn't make it through. > It might be safer to put the files somewhere in the web and send a link. > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap >wrote: > > > Hi German, > > > > Please find zookeeper config files attached. > > > > Thanks & Regards, > > Deepak > > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco < > > german.blanco.blanco@gmail.com> wrote: > > > >> Hello! > >> > >> Could you please post your configuration files? > >> > >> Regards, > >> > >> German. > >> > >> > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap >> >wrote: > >> > >> > Hi All, > >> > > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in > >> the > >> > quorum. > >> > The problem we are facing is that one zookeeper server in the quorum > >> falls > >> > apart, and never becomes part of the cluster until we restart > zookeeper > >> > server on that node. > >> > > >> > Our interpretation from zookeeper logs on all nodes is as follows: > >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk > >> server > >> > 3) > >> > Initially S3 is the leader while S1 and S2 are followers. > >> > > >> > S2 hits 46 sec latency while fsyncing write ahead log and results in > >> loss > >> > of connection with S3. > >> > S3 in turn prints following error message: > >> > > >> > Unexpected exception causing shutdown while sock still open > >> > java.net.SocketTimeoutException: Read timed out > >> > Stack trace > >> > ******* GOODBYE /169.254.1.2:47647(S2) ******** > >> > > >> > S2 in this case closes connection with S3(leader) and shuts down > >> follower > >> > with following log messages: > >> > Closing connection to leader, exception during packet send > >> > java.net.SocketException: Socket close > >> > Follower@194] - shutdown called > >> > java.lang.Exception: shutdown Follower > >> > > >> > After this point S3 could never reestablish connection with S2 and > >> leader > >> > election mechanism keeps failing. S3 now keeps printing following > >> message > >> > repeatedly: > >> > Cannot open channel to 2 at election address /169.254.1.2:3888 > >> > java.net.ConnectException: Connection refused. > >> > > >> > While S3 is in this state, S2 repeatedly keeps printing following > >> message: > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection > >> from > >> > / > >> > 127.0.0.1:60667 > >> > Exception causing close of session 0x0: ZooKeeperServer not running > >> > Closed socket connection for client /127.0.0.1:60667 (no session > >> > established for client) > >> > > >> > Leader election never completes successfully and causing S2 to fall > >> apart > >> > from the quorum. > >> > S2 was out of quorum for almost 1 week. > >> > > >> > While debugging this issue, we found out that both election and peer > >> > connection ports on S2 can't be telneted from any of the node (S1, > S2, > >> > S3). Network connectivity is not the issue. Later, we restarted the ZK > >> > server S2 (service zookeeper-server restart) -- now we could telnet to > >> both > >> > the ports and S2 joined the ensemble after a leader election attempt. > >> > Any idea what might be forcing S2 to get into a situation where it > won't > >> > accept any connections on the leader election and peer connection > ports? > >> > > >> > Should I file a jira on this and upload all log files while submitting > >> the > >> > jira as log files are close to 250MB each? > >> > > >> > Thanks & Regards, > >> > Deepak > >> > > >> > > > > > --089e01634ec2cc551b04f0a872f7--