Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D087D10934 for ; Thu, 23 Jan 2014 19:30:53 +0000 (UTC) Received: (qmail 71326 invoked by uid 500); 23 Jan 2014 19:30:51 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 71280 invoked by uid 500); 23 Jan 2014 19:30:50 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 71270 invoked by uid 99); 23 Jan 2014 19:30:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 19:30:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of german.blanco.blanco@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jan 2014 19:30:44 +0000 Received: by mail-wg0-f44.google.com with SMTP id l18so2040291wgh.35 for ; Thu, 23 Jan 2014 11:30:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3Na/O/a+oxCTcarDPKELsZyFo+4qujx21TL4dk/sVz0=; b=X5D3qQxszfHPeLUm12lUKg+0KEoUAAA8j8ivMatvh7P5kfA3YU0zYRI5ECp4JuoIB/ VFQYZ9jQRxMe6I66Yxgz6B1209En1tnHgbiFz0qE1XWHPP/Hs4VSfjdX2x9z96+RvvrV QNdQuzkj6OBiITEH9KBj18EyviRB8b+Rql+vkjHC6dXbEIGhweVRa+ggONdq5AcOWrxN 9zyOcUd5K5F95NY0ssUiLZtbh7RLLtKU78vX7jMB+L52LlsUfOQEjYzWMwVbcg8kYCmK S9flNHWSSW/0TXZAgqfN1e00zblUzPgIqFiDQOnvH3qgelPIZYMJcZaoUQCqQh7gnU5c RUag== MIME-Version: 1.0 X-Received: by 10.194.93.193 with SMTP id cw1mr2693945wjb.72.1390505424202; Thu, 23 Jan 2014 11:30:24 -0800 (PST) Received: by 10.216.70.77 with HTTP; Thu, 23 Jan 2014 11:30:24 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 20:30:24 +0100 Message-ID: Subject: Re: zk server falling apart from quorum due to connection loss and couldn't connect back From: German Blanco To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=047d7bdc0538de342d04f0a845d1 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc0538de342d04f0a845d1 Content-Type: text/plain; charset=ISO-8859-1 Sorry but the attachment didn't make it through. It might be safer to put the files somewhere in the web and send a link. On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap wrote: > Hi German, > > Please find zookeeper config files attached. > > Thanks & Regards, > Deepak > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco < > german.blanco.blanco@gmail.com> wrote: > >> Hello! >> >> Could you please post your configuration files? >> >> Regards, >> >> German. >> >> >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap > >wrote: >> >> > Hi All, >> > >> > We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in >> the >> > quorum. >> > The problem we are facing is that one zookeeper server in the quorum >> falls >> > apart, and never becomes part of the cluster until we restart zookeeper >> > server on that node. >> > >> > Our interpretation from zookeeper logs on all nodes is as follows: >> > (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk >> server >> > 3) >> > Initially S3 is the leader while S1 and S2 are followers. >> > >> > S2 hits 46 sec latency while fsyncing write ahead log and results in >> loss >> > of connection with S3. >> > S3 in turn prints following error message: >> > >> > Unexpected exception causing shutdown while sock still open >> > java.net.SocketTimeoutException: Read timed out >> > Stack trace >> > ******* GOODBYE /169.254.1.2:47647(S2) ******** >> > >> > S2 in this case closes connection with S3(leader) and shuts down >> follower >> > with following log messages: >> > Closing connection to leader, exception during packet send >> > java.net.SocketException: Socket close >> > Follower@194] - shutdown called >> > java.lang.Exception: shutdown Follower >> > >> > After this point S3 could never reestablish connection with S2 and >> leader >> > election mechanism keeps failing. S3 now keeps printing following >> message >> > repeatedly: >> > Cannot open channel to 2 at election address /169.254.1.2:3888 >> > java.net.ConnectException: Connection refused. >> > >> > While S3 is in this state, S2 repeatedly keeps printing following >> message: >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection >> from >> > / >> > 127.0.0.1:60667 >> > Exception causing close of session 0x0: ZooKeeperServer not running >> > Closed socket connection for client /127.0.0.1:60667 (no session >> > established for client) >> > >> > Leader election never completes successfully and causing S2 to fall >> apart >> > from the quorum. >> > S2 was out of quorum for almost 1 week. >> > >> > While debugging this issue, we found out that both election and peer >> > connection ports on S2 can't be telneted from any of the node (S1, S2, >> > S3). Network connectivity is not the issue. Later, we restarted the ZK >> > server S2 (service zookeeper-server restart) -- now we could telnet to >> both >> > the ports and S2 joined the ensemble after a leader election attempt. >> > Any idea what might be forcing S2 to get into a situation where it won't >> > accept any connections on the leader election and peer connection ports? >> > >> > Should I file a jira on this and upload all log files while submitting >> the >> > jira as log files are close to 250MB each? >> > >> > Thanks & Regards, >> > Deepak >> > >> > > --047d7bdc0538de342d04f0a845d1--