Return-Path: X-Original-To: apmail-accumulo-commits-archive@www.apache.org Delivered-To: apmail-accumulo-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4AF1E11F03 for ; Tue, 1 Jul 2014 05:30:59 +0000 (UTC) Received: (qmail 23533 invoked by uid 500); 1 Jul 2014 05:30:59 -0000 Delivered-To: apmail-accumulo-commits-archive@accumulo.apache.org Received: (qmail 23478 invoked by uid 500); 1 Jul 2014 05:30:58 -0000 Mailing-List: contact commits-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list commits@accumulo.apache.org Received: (qmail 23465 invoked by uid 99); 1 Jul 2014 05:30:58 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2014 05:30:58 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id AF5B89908AB; Tue, 1 Jul 2014 05:30:58 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: elserj@apache.org To: commits@accumulo.apache.org Date: Tue, 01 Jul 2014 05:30:58 -0000 Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: [1/4] git commit: ACCUMULO-2963 Update ReplicationDriver to try/catch each step in the main-loop. Repository: accumulo Updated Branches: refs/heads/master 79bb5c1c1 -> 7c4d620a9 ACCUMULO-2963 Update ReplicationDriver to try/catch each step in the main-loop. An RTE bubbling up from any step inside the ReplicationDriver, for example one coming from the BatchScanner on Thrift exception, will inadvertently kill the entire Daemon thread that runs replication. Try/catch the exception, log it, and then retry the operation on the next cycle. Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/73fc496a Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/73fc496a Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/73fc496a Branch: refs/heads/master Commit: 73fc496a5474528d9a5a6de0e4027b506473f6e1 Parents: 0faaabc Author: Josh Elser Authored: Tue Jul 1 00:48:23 2014 -0400 Committer: Josh Elser Committed: Tue Jul 1 01:23:14 2014 -0400 ---------------------------------------------------------------------- .../master/replication/ReplicationDriver.java | 24 ++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/73fc496a/server/master/src/main/java/org/apache/accumulo/master/replication/ReplicationDriver.java ---------------------------------------------------------------------- diff --git a/server/master/src/main/java/org/apache/accumulo/master/replication/ReplicationDriver.java b/server/master/src/main/java/org/apache/accumulo/master/replication/ReplicationDriver.java index e98bc1d..ce6f6dc 100644 --- a/server/master/src/main/java/org/apache/accumulo/master/replication/ReplicationDriver.java +++ b/server/master/src/main/java/org/apache/accumulo/master/replication/ReplicationDriver.java @@ -80,18 +80,34 @@ public class ReplicationDriver extends Daemon { // Make status markers from replication records in metadata, removing entries in // metadata which are no longer needed (closed records) // This will end up creating the replication table too - statusMaker.run(); + try { + statusMaker.run(); + } catch (Exception e) { + log.error("Caught Exception trying to create Replication status records", e); + } // Tell the work maker to make work - workMaker.run(); + try { + workMaker.run(); + } catch (Exception e) { + log.error("Caught Exception trying to create Replication work records", e); + } // Update the status records from the work records - finishedWorkUpdater.run(); + try { + finishedWorkUpdater.run(); + } catch (Exception e) { + log.error("Caught Exception trying to update Replication records using finished work records", e); + } // Clean up records we no longer need. // It must be running at the same time as the StatusMaker or WorkMaker // So it's important that we run these sequentially and not concurrently - rcrr.run(); + try { + rcrr.run(); + } catch (Exception e) { + log.error("Caught Exception trying to remove finished Replication records", e); + } Trace.offNoFlush();