Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 77757 invoked from network); 21 Apr 2010 00:40:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Apr 2010 00:40:01 -0000 Received: (qmail 78938 invoked by uid 500); 21 Apr 2010 00:40:01 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 78893 invoked by uid 500); 21 Apr 2010 00:40:00 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 78885 invoked by uid 99); 21 Apr 2010 00:40:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 00:40:00 +0000 X-ASF-Spam-Status: No, hits=-0.2 required=10.0 tests=AWL,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 00:39:53 +0000 Received: from SNV-EXPF01.ds.corp.yahoo.com (snv-expf01.ds.corp.yahoo.com [207.126.227.250]) by mrout2.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o3L0c7NP030215 for ; Tue, 20 Apr 2010 17:38:07 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:user-agent:date:subject:from:to:message-id: thread-topic:thread-index:in-reply-to:mime-version:content-type: content-transfer-encoding:x-originalarrivaltime; b=tYjZTmWIqMenAUyyX9OP4zmgfkUDLW9c3wjk1DVXurfBHJa2eHdOOH1VIOHwKNPB Received: from SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.86]) by SNV-EXPF01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 20 Apr 2010 17:38:06 -0700 Received: from 10.73.146.106 ([10.73.146.106]) by SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.84]) via Exchange Front-End Server snv-webmail.corp.yahoo.com ([207.126.227.59]) with Microsoft Exchange Server HTTP-DAV ; Wed, 21 Apr 2010 00:37:30 +0000 User-Agent: Microsoft-Entourage/12.24.0.100205 Date: Tue, 20 Apr 2010 17:37:28 -0700 Subject: Re: odd error message From: Mahadev Konar To: Message-ID: Thread-Topic: odd error message Thread-Index: Acrg6tHIXyxlUx4JB0GTDOFveac4CQ== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-OriginalArrivalTime: 21 Apr 2010 00:38:06.0935 (UTC) FILETIME=[E8FD1670:01CAE0EA] Ok, I think this is possible. So here is what happens currently. This has been a long standing bug and should be fixed in 3.4!!!! https://issues.apache.org/jira/browse/ZOOKEEPER-335 A newly elected leader currently doesn't log the new leader transaction to its database In your case, the follower (the 3rd server) did log it but the leader never did. Now when you brought up the 3rd server it had the transaction log present but the leader did not have that. In that case the 3rd server cried fowl and shut down. Removing the DB is totally fine. For now, we should update our docs on 3.3 and mention that this problem might occur during upgrade and fix it in 3.4. Thanks for bringing it up Ted. Thanks mahadev On 4/20/10 2:14 PM, "Ted Dunning" wrote: > We have just done an upgrade of ZK to 3.3.0. Previous to this, ZK has been > up for about a year with no problems. > > On two nodes, we killed the previous instance and started the 3.3.0 > instance. The first node was a follower and the second a leader. > > All went according to plan and no clients seemed to notice anything. The > stat command showed connections moving around as expected and all other > indicators were normal. > > When we did the third node, we saw this in the log: > > 2010-04-20 14:07:49,010 - FATAL [QuorumPeer:/0.0.0.0:2181:Follower@71] - > Leader epoch 18 is less than our epoch 19 > > The third node refused all connections. > > We brought down the third node, wiped away its snapshot, restarted and it > joined without complaint. Note that the third node > was originally a follower and had never been a leader during the upgrade > process. > > Does anybody know why this happened? > > We are fully upgraded and there was no interruption to normal service, but > this seems strange.