Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58646E1FA for ; Thu, 24 Jan 2013 18:45:13 +0000 (UTC) Received: (qmail 5803 invoked by uid 500); 24 Jan 2013 18:45:13 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 5763 invoked by uid 500); 24 Jan 2013 18:45:13 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 5615 invoked by uid 99); 24 Jan 2013 18:45:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jan 2013 18:45:12 +0000 Date: Thu, 24 Jan 2013 18:45:12 +0000 (UTC) From: "Keith Turner (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ACCUMULO-954) ZooLock watcher can stop watching MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Turner updated ACCUMULO-954: ---------------------------------- Fix Version/s: 1.4.3 1.5.0 This bug existed before 1.4.2, but was not likely to happen until the master started deleting stuff in zookeeper in ACCUMULO-766. Also see ACCUMULO-799. > ZooLock watcher can stop watching > --------------------------------- > > Key: ACCUMULO-954 > URL: https://issues.apache.org/jira/browse/ACCUMULO-954 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.4.2 > Reporter: Adam Fuchs > Assignee: Keith Turner > Priority: Minor > Fix For: 1.5.0, 1.4.3 > > > Basically, this will result in tablet servers failing to recognize when they lose their locks. I think the worst that can happen with this is a tablet server can fail to die after it loses its lock, which could bog down clients and create a bunch of noise in the cluster. I believe there could also be useless files generated that wouldn't get garbage collected. !METADATA table write protections and logger write protections should prevent any permanent damage or data loss. We have seen this result in warnings and errors that look like multiple hosting of tablets. > {code} > 2013-01-09 19:59:27,742 [tabletserver.TabletServer] INFO : port = 9997 > 2013-01-09 19:59:27,926 [zookeeper.ZooLock] DEBUG: event /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997 NodeDeleted SyncConnected > 2013-01-09 19:59:27,931 [tabletserver.TabletServer] INFO : Waiting for tablet server lock > 2013-01-09 19:59:32,943 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock /accumulo/655f93d8-20fc-451f-a457-458b5717a11e/tservers/172.16.2.25:9997/zlock-0000000000 > 2013-01-09 19:59:36,703 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM > {code} > Here's what happened: > 1. Tablet server fails to get lock, triggering the watcher on the parent node. > 2. Watcher doesn't get reset, and doesn't take any action. > 3. Loop in TabletServer:~2659 retries, but uses the same ZooLock object. > 4. TabletServer loses its lock, but receives a connection loss message before the NodeDeleted message. > 5. TabletServer continues to try to do work instead of killing itself. > We could probably patch this for 1.4 by creating the ZooLock within the announceExistence loop, instead of reusing the one. Eventually, we ought to have an else branch in both of the Watchers that either reset the watch (resilient against zookeeper connection hiccups) or just kill the server to be safe. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira