accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-954) ZooLock watcher can stop watching
Date Tue, 05 Feb 2013 00:00:15 GMT


Hudson commented on ACCUMULO-954:

Integrated in Accumulo-Trunk #702 (See [])
    ACCUMULO-954 Made zoolock rewatch its parent node and added some unit test for zoolock
(Revision 1442429)

     Result = SUCCESS
kturner : 
Files : 
* /accumulo/trunk/fate/src/main/java/org/apache/accumulo/fate/zookeeper/
* /accumulo/trunk/test/src/test/java/org/apache/accumulo/fate
* /accumulo/trunk/test/src/test/java/org/apache/accumulo/fate/zookeeper
* /accumulo/trunk/test/src/test/java/org/apache/accumulo/fate/zookeeper/

> ZooLock watcher can stop watching
> ---------------------------------
>                 Key: ACCUMULO-954
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.4.2
>            Reporter: Adam Fuchs
>            Assignee: Keith Turner
>            Priority: Minor
>             Fix For: 1.5.0, 1.4.3
> Basically, this will result in tablet servers failing to recognize when they lose their
locks. I think the worst that can happen with this is a tablet server can fail to die after
it loses its lock, which could bog down clients and create a bunch of noise in the cluster.
I believe there could also be useless files generated that wouldn't get garbage collected.
!METADATA table write protections and logger write protections should prevent any permanent
damage or data loss. We have seen this result in warnings and errors that look like multiple
hosting of tablets.
> {code}
> 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997
> 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/
NodeDeleted SyncConnected
> 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet server
> 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock
> 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet message from
user: !SYSTEM
> {code}
> Here's what happened:
> 1. Tablet server fails to get lock, triggering the watcher on the parent node.
> 2. Watcher doesn't get reset, and doesn't take any action.
> 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object.
> 4. TabletServer loses its lock, but receives a connection loss message before the NodeDeleted
> 5. TabletServer continues to try to do work instead of killing itself.
> We could probably patch this for 1.4 by creating the ZooLock within the announceExistence
loop, instead of reusing the one. Eventually, we ought to have an else branch in both of the
Watchers that either reset the watch (resilient against zookeeper connection hiccups) or just
kill the server to be safe.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message