hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Radia <sra...@yahoo-inc.com>
Subject Re: [Discuss] Merge federation branch HDFS-1052 into trunk
Date Wed, 27 Apr 2011 00:26:45 GMT
On Apr 25, 2011, at 2:36 PM, Doug Cutting wrote:
> A couple of questions:
> 1. Can you please describe the significant advantages this approach  
> has
> over a symlink-based approach?
> It seems to me that one could run multiple namenodes on separate boxes
> and run multile datanode processes per storage box configured with
> something like:
> .......


There are two separate issues;  your email seems to suggest that these  
are joined.
(1) creating (or not ) a unified namespace
(2) sharing the storage and the block storage layer across NameNodes -  
the architecture document covers this layering in great detail.
This separation reflects architecture of HDFS (derived from GFS) where  
the namespace layer is separate from the block storage layer (although  
the HDFS implementation violates the layers in many places).

HDFS-1052 deals with (2) - allowing multiple NameNodes to share the  
block storage layer.

As far as (1), creating a unified namespace,  federation does NOT  
dictate how you create a unified namespace or whether you even create  
a unified namespace in the first place. Indeed you may want to share  
the physical storage but want independent namespaces. For example, you  
may want to run a private namespace for HBase files within the same  
Hadoop cluster. Two different tenants sharing a cluster may choose to  
have their independent namespaces for isolation.

Of course in many situations one wants to create a unified namespace.  
One could create a unified namespace using symbolic links as you  
suggest. The federation work has also added client-side mount tables  
(HDFS-1053) (it is an implementation of FileSystem and  
AbstractFileSystem). It offers advantages over symbolic links but this  
is separable and you can use symbolic links if you like. HDFS-1053  
(client-side mount tables) makes no changes to any existing file system.

Now getting to (2),  sharing the physical storage and the block  
storage layer.
The approach you describe (run multiple DNs on the same machine which  
is essentially multiple super-imposed HDFS clusters)
is the most common reaction to this work and one which we also explored.
Unfortunately this approach runs into several issues and when you  
start exploring the details you realize that it is essentially a hack.
- Extra processes  running the DN on the same machine taking precious  
memory away from MR tasks.
- Independent pools of threads for each DN
- Not being able to schedule disk operations across multiple DNs
- Not being able to provide a unified view of balancing or  
decommissioning. For example, one could run multiple balancers but  
this will give you less control of bandwidth used for balancing.
- The disk-fail-in-place work  and the balance-disks-on-introducing-a- 
new-disk would become more  complicated to coordinate across DNs.
- Federation allows the cluster to be managed as a unit rather then as  
a a bunch of overlapping HDFS clusters. Overlapping HDFS clusters will  
be operationally taxing.

On the other hand, the new architecture generalizes the block storage  
layer and allow us to evolve it to address new needs. For example, it  
will allow us to address issues like offering tmp storage for  
intermediate MR output - one can allocate a block pool for MR tmp  
storage on each DN. HBase could also use the block storage layer  
directly without going through a name node.

> 2. ....  The patch modifies much
> of the logic of Hadoop's central component, upon which the performance
> and reliability of most other components of the ecosystem depend.

Changes to the code base
  - The fundamental code change is to extend the notion of block id to  
now include a block pool id.
- The  NN had little change, the protocols did change to include the  
block pool id.
- The DN code did change. Each data structure is now indexed by the  
block pool id -- while this is a code change, it is architecturally  
very simple and low risk.
- We also did a fair amount of cleanup of threads used to send block  
reports - while it was not strictly necessary to do the cleanup we  
took the extra effort to pay the technical debt. As Dhruba recently  
noted, adding support to send block reports to primary and secondary  
NN for HA will be now much easier to do.

The write and read pipelines - which are performance critical,  have  
NOT changed.

>  It seems to me that such an invasive change should be well tested  
> before it
> is merged to trunk.  Can you please tell me how this has been tested
> beyond unit tests?

Risk, Quality & Testing
Besides the amount of code change one has to ask the fundamental  
question:  how good is the design and how is the project managed.  
Conceptually, federation is very simple: pools of blocks are owned by  
a service (a NN in this case) and the block id is extended by an   
identifier called the block-pool id.
First and foremost - we wrote a very extensive architecture document -  
more comprehensive than any other document  in Hadoop in the past.  
This was published very early:  version 1 in march 2010 and version 5  
in april 2010 based on feedback we received from the community. We  
sought and incorporated feedback from other HDFs developers outside of  

The project was managed as a separate branch rather than introduce the  
code to trunk incrementally.
The branch has also been tested as a separate unit by us - this  
ensures that it does not destabilize trunk.

More details on testing.
The same QA process that drove and tested key stable Apache Hadoop   
releases (16, 17, 18, 20, 20-security) is being used for testing the  
federation feature. We have been running integrated tests with  
federation for a few months and continue to do so.
We will not deploy a Hadoop release with the federation feature in  
Yahoo clusters until we are confident that it is stable and reliable  
for our clusters. Indeed the level of testing is significantly more  
than in previous releases.

Hopefully the above addresses your concerns.


View raw message