accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Havanki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-2422) Backup master can miss acquiring lock when primary exits
Date Fri, 28 Feb 2014 14:02:21 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915804#comment-13915804
] 

Bill Havanki commented on ACCUMULO-2422:
----------------------------------------

Timeframe I observed is indefinitely, or at least until the agitator bounces both masters.

Stack dumps haven't been terribly helpful so far, although I haven't done a lot of testing
yet. Looking at the ZK data is more informative. I added a bunch of logging in master for
what it sees in ZK and what it decides to do. So far it appears, at least in one case, that
the backup master just didn't notice the active master's node getting deleted. I have even
more logging in there now, which I'm checking this morning, to see if it gets no event at
all, or why it doesn't process it.

> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>
>                 Key: ACCUMULO-2422
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2422
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.0
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
>
> While running randomwalk tests with agitation for the 1.5.1 release, I've seen situations
where a backup master that is eligible to grab the master lock continues to wait. When this
condition arises and the other master restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is needed to
see what circumstances could be causing the problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message