accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Havanki (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-2422) Backup master can miss acquiring lock when primary exits
Date Fri, 28 Feb 2014 15:06:26 GMT


Bill Havanki commented on ACCUMULO-2422:

I might have figured it out, though I still need to prove it.

The "losing" master server sets a ZK watch on the "winning" server's lock node, so that when
it disappears it can grab the lock. However, ZK watches are only good for _one event_ ([reference|]).
If something else happens to the node before it is deleted, then an event for that is sent,
but no event is sent for its deletion.

Once a master gets a lock, it replaces its lock node's data when it determines its port (see
ACCUMULO-1664 and ACCUMULO-1999). This triggers a NodeDataChanged event. Example:

2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - type  NodeDataChanged
2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - path  /accumulo/cdeab4df-78e3-4c7f-897b-92f4d98f9602/masters/lock/zlock-0000000206
2014-02-27 18:43:15,141 [zookeeper.ZooLock] DEBUG: - state SyncConnected

This event is sent to the other master's watcher, which does nothing with it, and then the
watch dies. So, it won't get a NodeDeleted event later to let it grab the lock. The way to
fix this is to set a new watch.

This scenario is difficult to create because both masters need to be started almost simultaneously,
and the losing watcher must set its watch between when the winning watcher creates its node
and replaces its node data. I'm going to try to trigger this by making the winning master
delay the replacement.

> Backup master can miss acquiring lock when primary exits
> --------------------------------------------------------
>                 Key: ACCUMULO-2422
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: fate, master
>    Affects Versions: 1.5.0
>            Reporter: Bill Havanki
>            Assignee: Bill Havanki
>            Priority: Critical
>              Labels: failover, locking
> While running randomwalk tests with agitation for the 1.5.1 release, I've seen situations
where a backup master that is eligible to grab the master lock continues to wait. When this
condition arises and the other master restarts, both wait for the lock without success.
> I cannot reproduce the problem reliably, and I think more investigation is needed to
see what circumstances could be causing the problem.

This message was sent by Atlassian JIRA

View raw message