hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Ryan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2601) Proposal to store edits and checkpoints inside HDFS itself for namenode HA
Date Wed, 30 Nov 2011 07:07:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159871#comment-13159871

Andrew Ryan commented on HDFS-2601:

Following on Todd's comment, it seems to me we now have proposals floating around to store
the image/edits data in one of three different locations: shared storage (NFS), Bookeeper,
and HDFS.

NFS is stable and mature, but good, reliable, redundant NFS hardware is expensive and proprietary.
The NFS filesystem has a lot of features which we don't need for this use case. There are
other shared storage filesystems out there too, but as far as I know none of them are in wide
use for storing image/edits data, so I'm only mentioning NFS.

HDFS is stable, but not for this use case, yet, and we'd need a lot of work to get it there.
And we still end up with a stack that, like NFS, is a gazillion lines of code, but handles
a lot of different stuff. Most of which we don't need for the image/edits application.

Bookeeper doesn't exist yet, but I like that it's special-purpose written just to provide
a minimal set of features to enable the image/edits scenario. So hopefully it won't be a gazillion
lines of code when it's done. But it will require a lot of time to stabilize and prove itself.

NFS as shared storage for image/edits works today. It's the basis of our HA namenode strategy
at Facebook for the forseeable future. It's hard to debate the merits of HDFS vs. BK for image/edits
storage, since neither exists, we're comparing unicorns to leprechauns. But my operational
instincts and experience running Hadoop tell me that either BK or a separate HDFS would be
best. But I'd need to better understand what the operational characteristics of each system
were, and these are not well-defined yet.

I'm looking forward to more discussion and hearing various viewpoints.
> Proposal to store edits and checkpoints inside HDFS itself for namenode HA
> --------------------------------------------------------------------------
>                 Key: HDFS-2601
>                 URL: https://issues.apache.org/jira/browse/HDFS-2601
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Karthik Ranganathan
> Would have liked to make this a "brainstorming" JIRA but couldn't find the option for
some reason.
> I have talked to a quite a few people about this proposal at Facebook internally (HDFS
folks like Hairong and Dhruba, as well as HBase folks interested in this feature), and wanted
to broaden the audience.
> At the core of the HA feature, we need 2 things:
> A. the secondary NN (or avatar stand-by or whatever we call it) needs to read all the
fsedits and fsimage data written by the primary NN
> B. Once the stand-by has taken over, the old NN should not be allowed to make any edits
> The basic idea is as follows (there are some variants, we can hone in on the details
if we like the general approach):
> 1. The write path for fsedits and fsimage: 
> 1.1 The NN uses a dfs client to write fsedits and fsimage. These will be regular hdfs
files written using the write pipeline.
> 1.2 Let us say the fsimage and fsedits files are written to a well-known location in
the local HDFS itself (say /.META or some such location)
> 1.3 The create files and add blocks to files in this path are not written to fsimage
or fsedits. The location of the blocks for the files in this location are known to all namenodes
- primary and standby - somehow (some possibilities here - write these block ids to zk or
use reserved block ids or write some meta-data into the blocks itself and store the blocks
in a well known location on all the datanodes)
> 1.4 If the replication factor on the write pipeline decreases, we close the block immediately
and allow NN to re-replicate to bring up the replication factor. We continue writing to a
new block
> 2. The read path on a NN failure
> 2.1 Since the new NN "knows" the location of the blocks for the fsedits and fsimage (again
the same possibilities as mentioned above), there is nothing to do to determine this
> 2.2 It can read the files it needs using the HDFS client itself
> 3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not be allowed
to perform any action
> 3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN will close
all these files baing written to by the old NN (and hence all the blocks)
> 3.2 The new NN (avatar NN) will write its address into ZK to let everyone know its the
> 3.3 The new NN now gets the lease for these files and starts writing into the fsedits
and fsimage
> 3.4 The old NN cannot write into the file as the block it was writing to was closed and
it does not have the lease. If it needs to re-open these files, it needs to check zk to see
it is indeed the current master, if not it should exit.
> 4. Misc considerations:
> 4.1 If needed, we can specify favored nodes to place the blocks for this data in specific
set of nodes (say we want to use a different set of RAIDed nodes, etc). 
> 4.2 Since we wont record the entries for /.META in fsedits and fsimage, a "hadoop dfs
-ls /" command wont show the files. This is probably ok, and can be fixed if not.
> 4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block ids - the
NN would need only these 80 block ids to read all the fsedits data. The fsimage data is even
lesser. This is very tractable using a variety of the techniques (the possibilities mentioned
> The advantage is that we are re-using the existing HDFS client (with some enhancements
of course), and making the solution self-sufficient on the existing HDFS. Also, the operational
complexity is greatly reduced.
> Thoughts?

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message