Return-Path: Delivered-To: apmail-hadoop-zookeeper-dev-archive@minotaur.apache.org Received: (qmail 27210 invoked from network); 9 Oct 2009 04:25:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Oct 2009 04:25:56 -0000 Received: (qmail 7536 invoked by uid 500); 9 Oct 2009 04:25:56 -0000 Delivered-To: apmail-hadoop-zookeeper-dev-archive@hadoop.apache.org Received: (qmail 7485 invoked by uid 500); 9 Oct 2009 04:25:56 -0000 Mailing-List: contact zookeeper-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-dev@hadoop.apache.org Delivered-To: mailing list zookeeper-dev@hadoop.apache.org Received: (qmail 7474 invoked by uid 99); 9 Oct 2009 04:25:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2009 04:25:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Oct 2009 04:25:53 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4F91C234C1EE for ; Thu, 8 Oct 2009 21:25:31 -0700 (PDT) Message-ID: <344623992.1255062331311.JavaMail.jira@brutus> Date: Thu, 8 Oct 2009 21:25:31 -0700 (PDT) From: "Patrick Hunt (JIRA)" To: zookeeper-dev@hadoop.apache.org Subject: [jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader In-Reply-To: <2078989139.1250727134900.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763829#action_12763829 ] Patrick Hunt commented on ZOOKEEPER-512: ---------------------------------------- I tried your latest patch with the latest trunk code and I'm not able to reproduce the problem. Looks like this is addressing the problem. > FLE election fails to elect leader > ---------------------------------- > > Key: ZOOKEEPER-512 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.2.0 > Reporter: Patrick Hunt > Assignee: Flavio Paiva Junqueira > Priority: Blocker > Fix For: 3.3.0 > > Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch > > > I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader. > See the attached log files - 5 member ensemble. typically 5 is the leader > Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum > environment: > I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() <= .005 => throw IOException > You can see when a fault is injected in the log via: > 2009-08-19 16:57:09,568 - INFO [Thread-74:ReadRequestFailsIntermittently@38] - READPACKET FORCED FAIL > vs a read/write that didn't force fail: > 2009-08-19 16:57:09,568 - INFO [Thread-74:ReadRequestFailsIntermittently@41] - READPACKET OK > otw standard code/config (straight fle quorum with 5 members) > also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.