accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Havanki (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2422) Backup master can miss acquiring lock when primary exits
Date Fri, 28 Feb 2014 15:22:21 GMT


Bill Havanki commented on ACCUMULO-2422:

What prevents it is that the master "gets" the lock if it has the lock node with the lowest
sequential number, as assigned by ZooKeeper. So, extending my example above, the first master
originally had the lock with node 206. The second master got 207, but noticed that 206 existed
already so it set up a watch on it. So far, so good.

Normally, when the first master exits, the second one gets the deletion event and gets the
lock. But in this scenario, the second master gets a node-change event instead. It loses the
watch and will never be notified again. Now, the first master exits, so all that is left is
node 207. The second master doesn't get the lock, it just waits and waits forever.

The first master restarts and gets node 208. It sees that the second master has 207, so it
sets up a watch on it, assuming that the second master has the lock. So, it doesn't get the
lock either. It waits and waits forever.

> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>                 Key: ACCUMULO-2422
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.0
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
> While running randomwalk tests with agitation for the 1.5.1 release, I've seen situations
where a backup master that is eligible to grab the master lock continues to wait. When this
condition arises and the other master restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is needed to
see what circumstances could be causing the problem.

This message was sent by Atlassian JIRA

View raw message