Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E93C610A21 for ; Fri, 28 Feb 2014 20:11:00 +0000 (UTC) Received: (qmail 12972 invoked by uid 500); 28 Feb 2014 20:10:55 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 12555 invoked by uid 500); 28 Feb 2014 20:10:43 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 12117 invoked by uid 99); 28 Feb 2014 20:10:32 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Feb 2014 20:10:32 +0000 Date: Fri, 28 Feb 2014 20:10:32 +0000 (UTC) From: "ASF subversion and git services (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-2422) Backup master can miss acquiring lock when primary exits MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916305#comment-13916305 ] ASF subversion and git services commented on ACCUMULO-2422: ----------------------------------------------------------- Commit 7eeff02c7cf883765a33575a19d208be30e1e17c in accumulo's branch refs/heads/master from [~bhavanki] [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=7eeff02 ] ACCUMULO-2422 Refine renewal of master lock watcher The first commit for ACCUMULO-2422 succeeds in renewing the watch on another master's lock node when needed. This commit refines the solution: - The renewal was happening even after the master is able to acquire the lock. This led to a spurious log error message. This commit skips renewing the watch in that case. - If the renewal returns a null status, meaning the other master's lock node disappeared, the master now immediately tries again to acquire the lock. This matches watch establishment in other areas. A lot of logging at the trace level was added to ZooLock to assist future troubleshooting. > Backup master can miss acquiring lock when primary exits > -------------------------------------------------------- > > Key: ACCUMULO-2422 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2422 > Project: Accumulo > Issue Type: Bug > Components: fate, master > Affects Versions: 1.5.1 > Reporter: Bill Havanki > Assignee: Bill Havanki > Priority: Critical > Labels: failover, locking > Fix For: 1.6.0, 1.5.2 > > > While running randomwalk tests with agitation for the 1.5.1 release, I've seen situations where a backup master that is eligible to grab the master lock continues to wait. When this condition arises and the other master restarts, both wait for the lock without success. > I cannot reproduce the problem reliably, and I think more investigation is needed to see what circumstances could be causing the problem. > h3. Diagnosis and Work Around > This failure condition can occur on start up and on backup/active failover of the Master role. If the follow log entry is the final entry on all Master logs you should restart all Master roles, staggering by a few seconds. > {noformat} > [master.Master] INFO : trying to get master lock > {noformat} > If starting a cluster with multiple Master roles, you should stagger Master role starts by a few seconds. -- This message was sent by Atlassian JIRA (v6.1.5#6160)