Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E44E2174B6 for ; Fri, 10 Oct 2014 21:32:34 +0000 (UTC) Received: (qmail 88694 invoked by uid 500); 10 Oct 2014 21:32:34 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 88635 invoked by uid 500); 10 Oct 2014 21:32:34 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 88623 invoked by uid 99); 10 Oct 2014 21:32:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2014 21:32:34 +0000 Date: Fri, 10 Oct 2014 21:32:34 +0000 (UTC) From: "Chris Nauroth (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167553#comment-14167553 ] Chris Nauroth commented on HDFS-7121: ------------------------------------- I agree that a pre-check strategy is likely to be good enough. Upgrade and rollback are operations that execute infrequently. Typically they're done during periods of low activity on the cluster with a close watch by an admin. Clients can't really connect anyway. It's highly likely that a pre-check would expose potential problems in most realistic scenarios, and the inherent time-of-check/time-of-use race condition is unlikely to happen. I'm going to draft a prototype patch for a pre-check RPC. > For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node. > ------------------------------------------------------------------------------------------------------------------------------ > > Key: HDFS-7121 > URL: https://issues.apache.org/jira/browse/HDFS-7121 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: journal-node > Reporter: Chris Nauroth > > Several JournalNode operations are not satisfied by a quorum. They must succeed on every JournalNode in the cluster. If the operation succeeds on some nodes, but fails on others, then this may leave the nodes in an inconsistent state and require operations to do manual recovery steps. For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then the operator will need to correct the problem on the failed node and also manually restore the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)