hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-6121) Support of "mount" onto HDFS directories
Date Wed, 19 Mar 2014 01:30:44 GMT
Yan created HDFS-6121:

             Summary: Support of "mount" onto HDFS directories
                 Key: HDFS-6121
                 URL: https://issues.apache.org/jira/browse/HDFS-6121
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode
            Reporter: Yan

Currently, HDFS configuration can only create HDFS on one or several existing local file system
directories. This pretty much abstracts physical disk drives away from HDFS users.

While it may provide conveniences in data movement/manipulation/management/formatting, it
could deprive users a way to access physical disks in a more directly controlled manner.

For instance, a multi-threaded server may wish to access its disk blocks sequentially per
thread for fear of random I/O otherwise. If the cluster boxes have multiple physical disk
drives, and the server load is pretty much I/O-bound, then it will be quite reasonable to
hope for disk performance typical of sequential I/O. Disk read-ahead and/or buffering at various
layers may alleviate the problem to some degree, but it couldn't totally eliminate it. This
could hurt hard performance of workloads than need to scan data.

Map/Reduce may experience the same problem as well.

For instance, HBase region servers may wish to scan disk data for each region in a sequential
way, again, to avoid random I/O. HBase incapability in this regard aside, one major obstacle
is with HDFS's incapability to specify mappings of local directories to HDFS directories.
Specifically, the "dfs.data.dir" configuration setting only allows for the mapping from one
or multiple local directories to the HDFS root directory. In the case of data nodes of multiple
disk drives mounted as multiple local file system directories per node, the HDFS data will
be spread on all disk drives in a pretty random manner, potentially resulting random I/O from
a multi-threaded server reading multiple data blocks from each thread.

A seemingly simple enhancement is an introduction of mappings from one or multiple local FS
directories to a single HDFS directory, plus necessary sanity checks, replication policies,
advices of best practices, ..., etc, of course. Note that this should be an one-to-one or
many-to-one mapping from local to HDFS directories. The other way around, though probably
feasible, won't serve our purpose at all. This is similar to the mounting of different disks
onto different local FS directories, and will give the users an option to place and access
their data in a more controlled and efficient way. 

Conceptually this option will allow for local physical data partition in a distributed environment
for application data on HDFS.

This message was sent by Atlassian JIRA

View raw message