hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6469) Coordinated replication of the namespace using ConsensusNode
Date Tue, 10 Jun 2014 05:48:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026135#comment-14026135
] 

Todd Lipcon commented on HDFS-6469:
-----------------------------------

I have a few concerns about this:

h3. Fine grained locking
As Suresh mentioned above, I'm concerned that the consensus engine must fully serialize all
write operations into the namespace. Though we currently already do this by means of the FSN
lock, there is already work ongoing by Daryn and others over at Yahoo to try to make the locking
more fine-grained. If the consensus engine fully serializes everything into a single request
stream, won't it become somewhat more difficult to get fine grained locking working?

Based on what I've seen in many production clusters, the single lock and the RPC system _are_
the bottleneck for many workloads, especially when some users perform heavy operations like
removal of a large directory tree or listing a large dir.

h3. Coordinated reads
{quote}
We assume that 
only a few special files, like job.xml, should be exposed to coordinated reads. Therefore,
an 
administrator can configure a set of patterns, which is recognized by a CNode, and when it

sees a file name matching the configured pattern it initiates a coordinated read for that
file. 
{quote}

This behavior makes me fairly nervous. It assumes that administrators know the full set of
applications that will run on the cluster and demand consistency. I don't think this is always
a fair assumption, and the behavior when not configured correctly will be very difficult to
debug or diagnose.

h3. Double journaling
Another issue with this design is that it doubles the amount of journaling required. The consensus
engine, to properly implement something like Paxos, needs to keep one journal, while the NN
keeps another. These might be put on separate disks to minimize seeks, but that does imply
that latency will be doubled as well. I'm surprised that you haven't seen this in your benchmarks,
unless you're running on a very fast device like PCIe flash.

h3. Non-determinism
Section 5 of the design doc talks about the many places in which the NameNode is not currently
fully deterministic. As you've described, there is a fair amount of work necessary to fix
these things (eg ensuring that block placement decisions agree even if the heartbeats from
datanodes come at slightly different times). This determinism will become even more complex
once we make the locking more fine-grained -- even something as simple as sequential block
ID generation is no longer trivial to make deterministic across nodes, since two operations
on distinct parts of the namespace may grab the ID in an unspecified order.

I'm wary that this determinism requirement will be a significant maintenance burden on HDFS
development in the future.

h3. Questions about the advantages
- Could you explain further why this design makes it any easier to implement a distributed
namespace? It seems fully orthogonal to me, and in fact due to the above determinism issues
may hamper the ease with which we could scale up/out the namenode implementation.

----
h3. Comparison vs an alternate design?
 
Overall, I'm wondering what advantages this approach might have over an alternate approach
that builds on the design we've already got. For example, consider the following:
- enable the configuration boolean that enables read from standby. This opens the same can
of worms as the ConsensusNode in that we need to have some sort of way to ensure consistent
reads when reading from slaves vs the leader. But, in the same way that you're proposing solving
it, the same solutions should work for the JN-based approach.
- add a small amount of code to the JN to allow a reader to "tail" the committed edits stream.
(eg a new servlet which uses chunked encoding to provide edit log "subscribers")
- change the SBN EditLogTailer to use the above interface instead of only reading from rolled
segments.

I think these changes would be less disruptive than adding a new NameNode subclass, and give
users the same benefits as you're proposing here. Additionally, there would be no double-logging
overhead, and we wouldn't have to worry about non-determinism in the implementation. Lastly,
a fully usable solution would be available to the community at large, whereas the design you're
proposing seems like it will only be usably implemented by a proprietary extension (I don't
consider the ZK "reference implementation" likely to actually work in a usable fashion).


> Coordinated replication of the namespace using ConsensusNode
> ------------------------------------------------------------
>
>                 Key: HDFS-6469
>                 URL: https://issues.apache.org/jira/browse/HDFS-6469
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: CNodeDesign.pdf
>
>
> This is a proposal to introduce ConsensusNode - an evolution of the NameNode, which enables
replication of the namespace on multiple nodes of an HDFS cluster by means of a Coordination
Engine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message