hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4114) Remove the CheckpointNode
Date Thu, 01 Nov 2012 18:41:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488914#comment-13488914
] 

Todd Lipcon commented on HDFS-4114:
-----------------------------------

Konstantin: could you please elaborate on how you use the BackupNode?

As discussed in the thread, it's difficult to see how it's usable in its current state, and
there has been no work in Apache to move it forward.

Here are the issues I see with the backupnode:

- It doesn't provide a hot standby, since it doesn't get any block information. I've seen
your prototype using a "load duplicator", but that software is not available in Apache, and
I don't think it would correctly handle the majority of the corner cases we had to solve during
HDFS-1623 development.

- Even with the above addressed, there is no functionality to "promote" a backup node to active,
so it doesn't provide HA at all.

- Because it uses RPC to transfer edits, it ties the availability and response time of the
Active to the availability and response time of the Backup. Up until recently (HDFS-3126)
there was no RPC timeout configured at all on the backup stream, so if the backup lost its
network connection or otherwise froze, the active would freeze for several minutes if not
indefinitely. Thus it actually _reduces_ availability in all currently released branches.

After adding the timeout, there is now the possibility that the active and backup are not
synchronized. Without external synchronization there is no way to know whether the two nodes
are synchronized, and thus even if we _had_ a way to promote the backup, there's be no safe
way to do so automatically without risking rollback of the namespace. So the backup cannot
be used for automatic failover in its current form without substantial design changes.

- Even if you are using the BN in an older version or a private fork, it is clear that you
aren't maintaining it in current releases. The backupnode tests were failing for many months
earlier this year with no one stepping up to fix them. Other contributors have had to step
in and maintain the code, eg with fixes like HDFS-2666, HDFS-2764, HDFS-3625.

So, to summarize, please justify your -1 with an explanation of how you are using the BackupNode
to provide some feature which is not already more mature and production-ready elsewhere in
Hadoop 2.x.
                
> Remove the CheckpointNode
> -------------------------
>
>                 Key: HDFS-4114
>                 URL: https://issues.apache.org/jira/browse/HDFS-4114
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>
> Per the thread on hdfs-dev@ (http://s.apache.org/tMT) let's remove the BackupNode and
CheckpointNode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message