Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BC98B105CA for ; Fri, 24 Jan 2014 01:20:26 +0000 (UTC) Received: (qmail 63664 invoked by uid 500); 24 Jan 2014 01:20:25 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 63596 invoked by uid 500); 24 Jan 2014 01:20:24 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 63584 invoked by uid 99); 24 Jan 2014 01:20:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 01:20:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of neha.narkhede@gmail.com designates 209.85.214.177 as permitted sender) Received: from [209.85.214.177] (HELO mail-ob0-f177.google.com) (209.85.214.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 01:20:18 +0000 Received: by mail-ob0-f177.google.com with SMTP id wp18so2889572obc.8 for ; Thu, 23 Jan 2014 17:19:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=F1BNph7v4Z2Aoc9JeccHCCG3BFUvzvK5Lx2VuI6Dozs=; b=q/Pal6iKI5bMNVhFtczQZV71HHpl22GOm/7jhv1R9gdbmX34xQPnzjqKYdIuGqlyg/ ZPtp8Mf2Ek0SrLRWVUz27c74Y2nOnBUNsNasRPeTuA4UN5cyRqlBOsD7JPS+G6p7rMPG v8VHcrmNK0y2beddvVlmyh/EswCPsFcoKKPiQjwNY1q/eUCcWP2I/xYR/k4jSBP3FdLu iJi1zwj65LhWrUEd9VAWHBa6bw8weHwen8NIWCroIoGhKFOedInr29sORoknZVGwxDwF bj/ezkvSmVZbx72/5Sb5xNBr2fyZOSFJ9lmeX8P7jIzfbeJnCFKnHWa4uWXJMMGBK0DP XBlw== MIME-Version: 1.0 X-Received: by 10.182.153.226 with SMTP id vj2mr10026282obb.26.1390526397117; Thu, 23 Jan 2014 17:19:57 -0800 (PST) Received: by 10.76.33.101 with HTTP; Thu, 23 Jan 2014 17:19:57 -0800 (PST) In-Reply-To: References: Date: Thu, 23 Jan 2014 17:19:57 -0800 Message-ID: Subject: Re: Problems with running ZK on a shared disk From: Neha Narkhede To: "user@zookeeper.apache.org" Content-Type: multipart/alternative; boundary=089e013d0dc0f37baa04f0ad27dd X-Virus-Checked: Checked by ClamAV on apache.org --089e013d0dc0f37baa04f0ad27dd Content-Type: text/plain; charset=ISO-8859-1 The timeout to increase would be the zookeeper "session timeout". For Kafka, the appropriate config is "zookeeper.session.timeout.ms". Thanks, Neha On Thu, Jan 23, 2014 at 2:05 PM, Ahmed H. wrote: > Thanks for the response Nikhil. > > What about timeouts? I have been reading about increasing timeouts to > alleviate some of those symptoms but I am unsure of which timeouts they are > referring to. Can you provide some insight? > > I currently have one Zookeeper instance so forceSync shouldn't have any > major downsides in this case. I will certainly give it a try when I get the > chance. > > Thanks > > > On Thu, Jan 23, 2014 at 3:17 PM, Nikhil wrote: > > > Try forcesync=no > > > > forceSync > > > > (Java system property: *zookeeper.forceSync*) > > > > Requires updates to be synced to media of the transaction log before > > finishing processing the update. If this option is set to no, ZooKeeper > > will not require updates to be synced to the media. > > > > > > This is a risk unless your zookeeper nodes are in the same rack. > > > > > > Check also this > > > > > http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/zookeeper_psuedo_scalability_and_absolute > > > > > > On Thu, Jan 23, 2014 at 10:53 AM, Ahmed H. > wrote: > > > > > Hello, > > > > > > I am running ZK on a shared disk (I know, I shouldn't be, but I am > > > constrained right now) alongside Kafka 0.8 beta. What we are > experiencing > > > is a problem where we get really long fsync times (according to the > > logs), > > > followed by a loss of connection of our Kafka clients. Kafka attempts > to > > > reconnect a few times and eventually it dies because it hits the > maximum > > > retry attempts. > > > > > > The fsync error is seen below: > > > > > > 2014-01-23 13:18:38,746 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 12762ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > 2014-01-23 13:23:41,332 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 7552ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > 2014-01-23 13:28:49,656 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 6350ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > 2014-01-23 13:33:45,063 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 1039ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > 2014-01-23 13:34:00,024 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 9490ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > 2014-01-23 13:44:09,003 [myid:] - WARN [SyncThread:0:FileTxnLog@321] > - > > > fsync-ing the write ahead log in SyncThread:0 took 8747ms which will > > > adversely effect operation latency. See the ZooKeeper troubleshooting > > guide > > > > > > > > > This is also followed by some of these for good measure: > > > > > > 2014-01-23 13:49:19,427 [myid:] - ERROR [SyncThread:0:NIOServerCnxn@180 > ] > > - > > > Unexpected Exception: > > > java.nio.channels.CancelledKeyException > > > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > > > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > > > at > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153) > > > at > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076) > > > at > > > > > > > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:170) > > > at > > > > > > > > > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:167) > > > at > > > > > > > > > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:101) > > > > > > > > > The way I see it is that I currently have two problems: 1) The setup of > > ZK > > > is an issue due to the shared disk, and 2) Kafka clients do not > > > automatically recover when it hits the maximum number of retries. I am > > > looking for a way to at least mitigate the zookeeper issue. Perhaps if > I > > > modify the timeouts in such a way that the Kafka clients don't fail > like > > > they do... > > > > > > What are the best ways to mitigate the issue for now, as I am limited > to > > a > > > single disk? Increasing tickTime? My current ZK config is the default > > that > > > comes with version 3.4.5, so the tickTime is 2000. My Kafka clients > have > > > defined the zktimeout variable to be 30000. > > > > > > I realize that this is a Zookeeper mailing list, but right now I cannot > > > pinpoint the exact cause of my problems, but it appears to me that ZK > is > > > the one. > > > > > > Thanks > > > > > > --089e013d0dc0f37baa04f0ad27dd--