Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B712095F7 for ; Thu, 27 Sep 2012 19:37:08 +0000 (UTC) Received: (qmail 39526 invoked by uid 500); 27 Sep 2012 19:37:08 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 39175 invoked by uid 500); 27 Sep 2012 19:37:08 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 38671 invoked by uid 99); 27 Sep 2012 19:37:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Sep 2012 19:37:07 +0000 Date: Fri, 28 Sep 2012 06:37:07 +1100 (NCT) From: "Eric Newton (JIRA)" To: notifications@accumulo.apache.org Message-ID: <1350657762.135310.1348774627738.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (ACCUMULO-777) isLockHeld needs better bullet-proofing against transient errors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Eric Newton created ACCUMULO-777: ------------------------------------ Summary: isLockHeld needs better bullet-proofing against transient errors Key: ACCUMULO-777 URL: https://issues.apache.org/jira/browse/ACCUMULO-777 Project: Accumulo Issue Type: Bug Components: client Affects Versions: 1.3.5, 1.4.0, 1.3.6, 1.4.1 Environment: medium sized cluster Reporter: Eric Newton Assignee: Eric Newton Fix For: 1.4.2, 1.4.1 During the minor compaction, the zookeeper lock for the tablet server is double-checked prior to updating the METADATA table information. In one unlucky moment, the zookeeper connection was lost during this check. The tablet server failed the check, but the lock was not lost. As a result, the root tablet remained hosted for another 4 weeks, but did not flush mutations to disk. When memory filled, the operator noticed a long hold time and killed the server. This caused a log recovery of 98 1G of logs, some of which were very old. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira