hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.
Date Tue, 23 Sep 2014 22:36:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145556#comment-14145556
] 

Chris Nauroth commented on HDFS-7121:
-------------------------------------

bq. I think it's probably good enough to just check if all JournalNodes are present before
sending out the doPreUpgrade message.

Hi Colin.  This is coming out of a production support issue in which some invalid file system
permissions caused the rename from current to previous.tmp to fail on 1 out of 3 JournalNodes.
 There weren't any nodes down.  A pre-check like you suggested wouldn't have helped protect
against this, because the failure wouldn't show up until actually attempting to do the work.

> For JournalNode operations that must succeed on all nodes, attempt to undo the operation
on all nodes if it fails on one node.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7121
>                 URL: https://issues.apache.org/jira/browse/HDFS-7121
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: journal-node
>            Reporter: Chris Nauroth
>
> Several JournalNode operations are not satisfied by a quorum.  They must succeed on every
JournalNode in the cluster.  If the operation succeeds on some nodes, but fails on others,
then this may leave the nodes in an inconsistent state and require operations to do manual
recovery steps.  For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node,
then the operator will need to correct the problem on the failed node and also manually restore
the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message