hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind Bhandarkar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-64) DataNode should be capable of managing multiple volumes
Date Mon, 07 Aug 2006 21:40:16 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-64?page=comments#action_12426338 ] 
            
Milind Bhandarkar commented on HADOOP-64:
-----------------------------------------

Proposal:

In the configuration (e.g. hadoop-site.xml), site-admin can specify a comma separated list
of volumes as a value corresponding to key "dfs.data.dir". These volumes are assumed to be
mounted on different disks. Thus the total disk capacity for the datanode is assumed to be
a sum of disk capacities of these volumes, in addition, taking into account the /dev/sda*
or /dev/hda* mapping of these volumes (i.e. not counting the same /dev/* twice.)

New blocks are created round-robin in these volumes. The policy for block-allocation is controlled
by a separable piece of code, so that different policies can be substituted at runtime later.
Mapping of datablocks to volume-id is kept in memory of datanode. When the datanode comes
up again, it discovers this mapping by reading specified volumes. Later, when datanode is
also periodically checkpointed, this mapping is stored in the checkpoint as well.

Each volume is further automatically split into multiple subdirectories (the number of these
directories is configurable, and should be a power of 2, so that the last x bits of a block-id
is used to determine which subdirectory the block is stored in. this is the scheme used in
Mike's patch for hadoop-50.

If a datanode is re-configured with different number (or locations) of volumes for dfs.data.dir,
the blocks stored in earlier locations are considered by the datanode to be lost (when in
future, the datanode is checkpointed, it will try to recover those "lost" blocks). If one
of the volumes is read-only, it will currently be considered to be dead only with-respect-to
that volume. i.e. it will still continue to store blocks in read-write volumes, but blocks
in the read-only volumes will be considered lost, since they cannot be deleted.)

Please comment on this proposal asap, so that I can go ahead with implementation.


> DataNode should be capable of managing multiple volumes
> -------------------------------------------------------
>
>                 Key: HADOOP-64
>                 URL: http://issues.apache.org/jira/browse/HADOOP-64
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.2.0
>            Reporter: Sameer Paranjpye
>         Assigned To: Milind Bhandarkar
>            Priority: Minor
>             Fix For: 0.6.0
>
>
> The dfs Datanode can only store data on a single filesystem volume. When a node runs
its disks JBOD this means running a Datanode per disk on the machine. While the scheme works
reasonably well on small clusters, on larger installations (several 100 nodes) it implies
a very large number of Datanodes with associated management overhead in the Namenode.
> The Datanod should be enhanced to be able to handle multiple volumes on a single machine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message