hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.
Date Tue, 23 Sep 2014 22:58:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145576#comment-14145576
] 

Colin Patrick McCabe commented on HDFS-7121:
--------------------------------------------

Good point.  I wasn't thinking of that failure case.

I think a "pre-check" should include checking that we have the ability to write to the target
directory.  POSIX has access() for this... maybe Java never bothered to implement this, but
we could get something similar by creating a directory there with a random UUID and then immediately
deleting it.  If we can do that, then it's almost certain that we can do the rename later,
barring something exotic like ACLs or selinux.

Of course, even if we did two-phase commit, we'd still have to do something meaningful in
the "promise" phase.  That would mean doing exactly this check that the filesystem permissions
were sane.  Otherwise the node would be making a promise it couldn't keep.

I don't like the "have everyone do the rename and rollback everyone if someone fails" solution
that you mentioned earlier.  I think it's rather complex and has a lot of weird corner cases
(like if rollback fails).  Plus, we already have something called "rollback" and this would
create more terminology confusion.

> For JournalNode operations that must succeed on all nodes, attempt to undo the operation
on all nodes if it fails on one node.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7121
>                 URL: https://issues.apache.org/jira/browse/HDFS-7121
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: journal-node
>            Reporter: Chris Nauroth
>
> Several JournalNode operations are not satisfied by a quorum.  They must succeed on every
JournalNode in the cluster.  If the operation succeeds on some nodes, but fails on others,
then this may leave the nodes in an inconsistent state and require operations to do manual
recovery steps.  For example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node,
then the operator will need to correct the problem on the failed node and also manually restore
the previous.tmp directory to current on the 2 successful nodes before reattempting the upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message