Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8330617BE2 for ; Tue, 31 Mar 2015 19:41:53 +0000 (UTC) Received: (qmail 51420 invoked by uid 500); 31 Mar 2015 19:41:53 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 51381 invoked by uid 500); 31 Mar 2015 19:41:53 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 51370 invoked by uid 99); 31 Mar 2015 19:41:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Mar 2015 19:41:53 +0000 Date: Tue, 31 Mar 2015 19:41:53 +0000 (UTC) From: "Billie Rinaldi (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3569) Automatically restart accumulo processes intelligently MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389258#comment-14389258 ] Billie Rinaldi commented on ACCUMULO-3569: ------------------------------------------ I'd rather this be turned off by default. > Automatically restart accumulo processes intelligently > ------------------------------------------------------ > > Key: ACCUMULO-3569 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3569 > Project: Accumulo > Issue Type: Bug > Components: scripts > Reporter: John Vines > Fix For: 1.7.0 > > Attachments: 0001-ACCUMULO-3569-initial-pass-at-integrating-auto-resta.patch > > > On occasion process will die, for a variety of reasons. Some reasons are critical whereas others may be due to momentary blips. There are a variety of reasons, but not all of the reasons warrant keeping the server down and requiring human attention. > With that, I would like to propose a watcher process, which is an option component that wraps the calls to the various processes (tserver, master, etc.). This process can watch the processes, get their exit codes, read their logs, etc. and make intelligent decisions about how to behave. This behavior would include coarse detection of failure types (will discuss below) and a configurable response behavior around how many attempts should be made in a given window before giving up entirely. > As for failure types, there are a few arch ones that seem to be regularly repeating that I think are prime candidates for an initial approach- > Zookeeper lock lost - this can happen for a variety of reasons, mostly related to network issues or server (tserver or zk node) congestion. These are some of the most common errors and are typically transient. However, if these occur with great frequency then it's a sign of a larger issue that needs to be handled by an administrator. > Jvm OOM - There are two spaces where these really seem to occur - a system that's just poorly configured and dies shortly after it starts up and then there is the case where the system gets slammed in just the right way where objects in our code and/or the iterator stack may push the JVM just over the limits. In the former case, this will fail quickly and relatively rapidly when being restarted, whereas the latter case is something that will occur rarely and will want attention, but doesn't warrant keeping the node offline in the meantime. > Standard shutdown - this is just a case that occurs where we don't want it to automatically interact because we want it to go down. Just a design consideration. > Unexpected exceptions - this is a catch all for everything else. We can attempt to enumerate them, but they're less common. This would be something configured to have less tolerance for, but just because a server goes down due to a random software bug doesn't mean that server should be removed from the cluster unless it happens repeatedly (because then it's a sign of a hardware/system issue). But we should provide the ability to keep resources available in this space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)