Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CA9F8184AD for ; Thu, 25 Jun 2015 14:34:43 +0000 (UTC) Received: (qmail 50113 invoked by uid 500); 25 Jun 2015 14:34:43 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 50067 invoked by uid 500); 25 Jun 2015 14:34:42 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 50056 invoked by uid 99); 25 Jun 2015 14:34:42 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2015 14:34:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 42A6D1A6090 for ; Thu, 25 Jun 2015 14:34:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.001 X-Spam-Level: **** X-Spam-Status: No, score=4.001 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id u8xvVOfP28HP for ; Thu, 25 Jun 2015 14:34:32 +0000 (UTC) Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 014D8261B0 for ; Thu, 25 Jun 2015 14:34:31 +0000 (UTC) Received: by wicnd19 with SMTP id nd19so19884842wic.1 for ; Thu, 25 Jun 2015 07:34:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=LP1T+SxCTIs86UVS7pUiDQk24UK2U/wsPlFO2Lb3dNc=; b=m4OkM6Jtp6Zkpwu12M48lmMjntimYkCcLTCsw2z6zuVwQENQWddWYqr9wWaPoE1Crz IT6AhGNifvHFOzSXqNDvArKHocknFK+VWrNTcKJHQIqMft7hCMZQHpBPC7OTkmR97gzC eUvoPZeJIE9OCXZqh/No+bS/5PoBM7XU36Xziw7YlcyG03A5WwIt1agJjynGpVNtSSmJ c1fKGdy+uQSCBtMQrbmNJ2vfq2hKtnI0MORw9rNyW61EbWLjM76RVi9qA2gUYDTBZJ2T 09g0PKbXFkHrSf4UOVzPvjr96gHMv/+TQhUabdBMWnMob1T7YmR1m2MWiE30gsTCe0Vp dP9g== X-Gm-Message-State: ALoCoQmCyarj0QZtTpMs6InmfBO7U0uboKtv77o45Ui39g/FpgdbYToRM0xsce1caaiBjoatFWO0 X-Received: by 10.194.203.138 with SMTP id kq10mr78798215wjc.124.1435242870392; Thu, 25 Jun 2015 07:34:30 -0700 (PDT) Received: from [10.72.0.4] ([91.183.125.230]) by mx.google.com with ESMTPSA id be3sm7909717wib.21.2015.06.25.07.34.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 25 Jun 2015 07:34:29 -0700 (PDT) From: Filip Deleersnijder Content-Type: multipart/alternative; boundary="Apple-Mail=_6FC4ECB5-D2BC-4342-A89B-AD5941C32271" Message-Id: <26F7CD6B-9381-4DD7-B612-6E506736A044@motum.be> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: Leader election problems Date: Thu, 25 Jun 2015 16:34:27 +0200 References: <0ABDD99C-B3C1-4278-B6E8-6A997658B988@motum.be> <24927B98-DABB-4D6D-8BCB-9F8D60896EAC@motum.be> To: user@zookeeper.apache.org In-Reply-To: X-Mailer: Apple Mail (2.2070.6) --Apple-Mail=_6FC4ECB5-D2BC-4342-A89B-AD5941C32271 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi, I can see that all of our logs contain the following log-statements = pretty often. 2015-06-22 12:02:00,752 [myid:2] - DEBUG [main:DataTree@949][] - = Ignoring processTxn failure hdr: -1 : error: -2 2015-06-22 12:02:00,753 [myid:2] - DEBUG [main:DataTree@949][] - = Ignoring processTxn failure hdr: 14 : error: -101 2015-06-25 14:02:39,505 [myid:3] - DEBUG = [QuorumPeer[myid=3D3]/0:0:0:0:0:0:0:0:2181:FileTxnLog$FileTxnIterator@636]= - EOF excepton java.io.EOFException: Failed to read = c:\motum\config\MASS\ZK\version-2\log.1aa00000001 Since we don=E2=80=99t properly shut the ZK process down ( we just = shutdown windows ), this properly can cause corruption of files. Is there somebody that has a clear idea about whether the =E2=80=9CEOF=E2=80= =9D or the =E2=80=9CIgnoring processTxn=E2=80=9D problems could cause = frequent and long during Leader Elections ? Any help is greatly appreciated, Filip > On 25 Jun 2015, at 11:51, Guy Moshkowich = wrote: >=20 > Are you using ZK client on your vehicles or ZK servers? > You mentioned below 8 vehicles and i see 8 servers defined in the = config. > I would expect you have 8 client(running on your vehicles) = communicating > against 1 or 3 ZK servers as this will be more than enough for 8 = clients. > Guy >=20 > On Thursday, June 25, 2015, Filip Deleersnijder > wrote: >=20 >> Hi, >>=20 >> Thanks for your response. >>=20 >> Our application consists of 8 automatic vehicles in a warehouse = setting. >> Those vehicles need some consensus decisions, and that is what we use >> Zookeeper for. >> Because vehicles can come and go at random, we installed a ZK = participant >> on every vehicle. The ZK client is some other piece of software that = is >> also running on the vehicles. >>=20 >> Therefor : >> - We can not choose the number of ZK-participants because it = just >> depends on the number of vehicles. >> - The participants communicate over Wifi >> - The client is running on the same machine, so it = communicates >> over the local network >>=20 >> We are running Zookeeper version 3.4.6 >>=20 >> Our zoo.cfg can be found below this e-mail. >>=20 >> Thanks in advance ! >>=20 >> Filip >>=20 >> # The number of milliseconds of each tick >> tickTime=3D2000 >> # The number of ticks that the initial >> # synchronization phase can take >> initLimit=3D10 >> # The number of ticks that can pass between >> # sending a request and getting an acknowledgement >> syncLimit=3D5 >> # the directory where the snapshot is stored. >> # do not use /tmp for storage, /tmp here is just >> # example sakes. >> dataDir=3Dc:/motum/config/MASS/ZK >> # the port at which the clients will connect >> clientPort=3D2181 >>=20 >> server.1=3D172.17.35.11:2888:3888 >> server.2=3D172.17.35.12:2888:3888 >> server.3=3D172.17.35.13:2888:3888 >> server.4=3D172.17.35.14:2888:3888 >> server.5=3D172.17.35.15:2888:3888 >> server.6=3D172.17.35.16:2888:3888 >> server.7=3D172.17.35.17:2888:3888 >> server.8=3D172.17.35.18:2888:3888 >>=20 >> # The number of snapshots to retain in dataDir >> # Purge task interval in hours >> # Set to "0" to disable auto purge feature >> autopurge.snapRetainCount=3D3 >> autopurge.purgeInterval=3D1 >>=20 >>=20 >>=20 >>> On 24 Jun 2015, at 18:54, Ra=C3=BAl Guti=C3=A9rrez Segal=C3=A9s = > > wrote: >>>=20 >>> Hi, >>>=20 >>> On 24 June 2015 at 06:05, Filip Deleersnijder >> > wrote: >>>=20 >>>> Hi, >>>>=20 >>>> Let=E2=80=99s start with some description of our system : >>>>=20 >>>> - We our using a Zookeeper cluster with 8 participants for an >> application >>>> with mobile nodes ( connected over Wifi ). >>>>=20 >>>=20 >>> You mean the participants talk over wifi or the clients? >>>=20 >>>=20 >>>> ( Ip of the different nodes are according to the following = structure : >>>> Node X has IP : 172.17.35.1X ) >>>>=20 >>>=20 >>> Why 8 and not an odd number of machines (i.e.: >>>=20 >> = http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkMulitServer= Setup = >>> )? >>>=20 >>> - It is not that unusual to have a node being shut-down or restarted >>>> - We haven=E2=80=99t benchmarked the number of write operations = yet, but I would >>>> estimate that it would be less than 10 writes / second >>>>=20 >>>=20 >>> What version of ZK are you using? >>>=20 >>>=20 >>>>=20 >>>> The problem we are having however is that sometimes(*), some = instances >>>> seem to be having problems with leader election. >>>> Under the header =E2=80=9CAttachment 1=E2=80=9D below, you can find = the leader election >>>> times that were needed over 24h ( from 1 node ). One average it = took >> more >>>> than 1 minute ! >>>> I assume that this is not normal behaviour ? ( If somebody could = confirm >>>> that in a 8-node cluster, these are not normal leader election = times, >> that >>>> would be nice ) >>>>=20 >>>> In attachement 2 : I included an extract from the logging during a >> leader >>>> election that took 101874ms for 1 node ( server 2 ). >>>>=20 >>>> Any help is greatly appreciated. >>>> If further or more specific logging is required, please ask ! >>>>=20 >>>>=20 >>> Do you mind sharing a copy of your config file (zoo.cfg)? Thanks! >>>=20 >>>=20 >>> -rgs --Apple-Mail=_6FC4ECB5-D2BC-4342-A89B-AD5941C32271--