Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 56CAF113C6 for ; Thu, 3 Jul 2014 19:36:34 +0000 (UTC) Received: (qmail 2032 invoked by uid 500); 3 Jul 2014 19:36:34 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 1990 invoked by uid 500); 3 Jul 2014 19:36:34 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 1969 invoked by uid 99); 3 Jul 2014 19:36:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jul 2014 19:36:34 +0000 Date: Thu, 3 Jul 2014 19:36:34 +0000 (UTC) From: "Sean Busbey (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-2976) blacklist problematic tservers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051846#comment-14051846 ] Sean Busbey commented on ACCUMULO-2976: --------------------------------------- I think it's more consistent with ops for other projects to handle it ourselves. > blacklist problematic tservers > ------------------------------ > > Key: ACCUMULO-2976 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2976 > Project: Accumulo > Issue Type: Improvement > Components: master > Reporter: Sean Busbey > Priority: Minor > > It would be nice if the master kept track of tservers that misbehave and eventually blacklisted them, similar to how HDFS handles datanodes and MapReduce/YARN handle trackers. > Right now the closest we do is having the Master killing the zoolock for tservers that are behaving poorly. This causes them to exit if they're not in a zombie state. > On deployments with a watchdog that relaunches failed processes, this doesn't help much because the tserver comes back. In the case of i.e. flakey network failures for the node this just means repeating the process and impacting cluster performance while the master works out that it should kill the node again. -- This message was sent by Atlassian JIRA (v6.2#6252)