Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Tue, 10 Jun 2014 05:48:05 +0000 (UTC)
From: "Todd Lipcon (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12717370.1401396436682.99769.1402379285218@arcas>
In-Reply-To: <JIRA.12717370.1401396436682@arcas>
References: <JIRA.12717370.1401396436682@arcas>
Subject: [jira] [Commented] (HDFS-6469) Coordinated replication of the
 namespace using ConsensusNode
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026135#comment-14026135 ] 

Todd Lipcon commented on HDFS-6469:
-----------------------------------

I have a few concerns about this:

h3. Fine grained locking
As Suresh mentioned above, I'm concerned that the consensus engine must fully serialize all write operations into the namespace. Though we currently already do this by means of the FSN lock, there is already work ongoing by Daryn and others over at Yahoo to try to make the locking more fine-grained. If the consensus engine fully serializes everything into a single request stream, won't it become somewhat more difficult to get fine grained locking working?

Based on what I've seen in many production clusters, the single lock and the RPC system _are_ the bottleneck for many workloads, especially when some users perform heavy operations like removal of a large directory tree or listing a large dir.

h3. Coordinated reads
{quote}
We assume that 
only a few special files, like job.xml, should be exposed to coordinated reads. Therefore, an 
administrator can configure a set of patterns, which is recognized by a CNode, and when it 
sees a file name matching the configured pattern it initiates a coordinated read for that file. 
{quote}

This behavior makes me fairly nervous. It assumes that administrators know the full set of applications that will run on the cluster and demand consistency. I don't think this is always a fair assumption, and the behavior when not configured correctly will be very difficult to debug or diagnose.

h3. Double journaling
Another issue with this design is that it doubles the amount of journaling required. The consensus engine, to properly implement something like Paxos, needs to keep one journal, while the NN keeps another. These might be put on separate disks to minimize seeks, but that does imply that latency will be doubled as well. I'm surprised that you haven't seen this in your benchmarks, unless you're running on a very fast device like PCIe flash.

h3. Non-determinism
Section 5 of the design doc talks about the many places in which the NameNode is not currently fully deterministic. As you've described, there is a fair amount of work necessary to fix these things (eg ensuring that block placement decisions agree even if the heartbeats from datanodes come at slightly different times). This determinism will become even more complex once we make the locking more fine-grained -- even something as simple as sequential block ID generation is no longer trivial to make deterministic across nodes, since two operations on distinct parts of the namespace may grab the ID in an unspecified order.

I'm wary that this determinism requirement will be a significant maintenance burden on HDFS development in the future.

h3. Questions about the advantages
- Could you explain further why this design makes it any easier to implement a distributed namespace? It seems fully orthogonal to me, and in fact due to the above determinism issues may hamper the ease with which we could scale up/out the namenode implementation.

----
h3. Comparison vs an alternate design?
 
Overall, I'm wondering what advantages this approach might have over an alternate approach that builds on the design we've already got. For example, consider the following:
- enable the configuration boolean that enables read from standby. This opens the same can of worms as the ConsensusNode in that we need to have some sort of way to ensure consistent reads when reading from slaves vs the leader. But, in the same way that you're proposing solving it, the same solutions should work for the JN-based approach.
- add a small amount of code to the JN to allow a reader to "tail" the committed edits stream. (eg a new servlet which uses chunked encoding to provide edit log "subscribers")
- change the SBN EditLogTailer to use the above interface instead of only reading from rolled segments.

I think these changes would be less disruptive than adding a new NameNode subclass, and give users the same benefits as you're proposing here. Additionally, there would be no double-logging overhead, and we wouldn't have to worry about non-determinism in the implementation. Lastly, a fully usable solution would be available to the community at large, whereas the design you're proposing seems like it will only be usably implemented by a proprietary extension (I don't consider the ZK "reference implementation" likely to actually work in a usable fashion).


> Coordinated replication of the namespace using ConsensusNode
> ------------------------------------------------------------
>
>                 Key: HDFS-6469
>                 URL: https://issues.apache.org/jira/browse/HDFS-6469
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: CNodeDesign.pdf
>
>
> This is a proposal to introduce ConsensusNode - an evolution of the NameNode, which enables replication of the namespace on multiple nodes of an HDFS cluster by means of a Coordination Engine.


--
This message was sent by Atlassian JIRA
(v6.2#6252)