hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DFS_requirements" by KonstantinShvachko
Date Sun, 21 Mar 2010 02:20:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "DFS_requirements" page has been changed by KonstantinShvachko.
http://wiki.apache.org/hadoop/DFS_requirements?action=diff&rev1=11&rev2=12

--------------------------------------------------

    a. A document is needed describing general procedures required to transition DFS from
one software version to another. <<BR>> [[Hadoop_Upgrade]], [[http://issues.apache.org/jira/browse/HADOOP-702|HADOOP-702]]
__Done__.
   1. (Reliability) The name node should boost the '''priority of re-replicating blocks'''
that are far from their replication target. If necessary it should delay requests for new
blocks, opening files etc., in favor of re-replicating blocks that are close to being lost
forever. <<BR>>[[http://issues.apache.org/jira/browse/HADOOP-659|HADOOP-659]]
__Done__.
   1. (Functionality) Currently DFS supports exclusive on create only '''file appends'''.
We need more general appends that would allow re-opening files for appending. Our plan is
to implement it in two steps:
-   a. Exclusive appends.
-   a. Concurrent appends. <<BR>>[[http://issues.apache.org/jira/browse/HADOOP-1700|HADOOP-1700]],
[[http://issues.apache.org/jira/browse/HDFS-265|HDFS-265]]                               
                                     __Done__.
+   a. Exclusive appends. <<BR>>[[http://issues.apache.org/jira/browse/HADOOP-1700|HADOOP-1700]],
[[http://issues.apache.org/jira/browse/HDFS-265|HDFS-265]]                               
                                     __Done__.
+   a. Concurrent appends.
   1. (Functionality) Support for '''“truncate”''' operation. <<BR>> ''This
is a new functionality that is not currently supported by DFS.''
   1. (Functionality) '''Configuration''':
    a. Accepting/rejecting rules for hosts and users based on regular expressions. The string
that is matched against the regular expression should include the host, user, and cluster
names. <<BR>>[[http://issues.apache.org/jira/browse/HADOOP-442|HADOOP-442]]  
                                                                  __Done__.
@@ -68, +68 @@

   1. (Performance) Client writes should '''flush directly to DFS''' based on the buffer size
set at creation of the stream rather than collecting data in a temporary file on a local disk.<<BR>>
[[http://issues.apache.org/jira/browse/HADOOP-66|HADOOP-66]] __Done__.
   1. (Performance) Currently '''data nodes report''' the entire list of stored blocks to
the name node once in an hour. Most of this information is redundant. Processing of large
reports reduces the name node availability for application tasks. <<BR>> Possible
solutions:
    a. Data nodes report a portion (e.g. 20%, or bounded by the total size of transmitted
data) of their blocks but (5 times) more often.
-   a. Data nodes report just the delta with the removed blocks being explicitly marked as
such. <<BR>> ~-On startup the name node restores its state from a checkpoint.
The checkpoint stores information about files and their blocks, but not the block locations.
The locations are restored from the data node reports. That is why, at startup data nodes
need to report complete lists of stored blocks. Subsequent reports do not need to contain
all blocks, just the ones that have been modified since the last report. <<BR>>
Each data node reports its blocks in one hour intervals. In order to avoid traffic jams the
name node receives reports from different data nodes at different randomized times. Thus,
on e.g. a 600 node cluster the name node receives 10 reports per minute, meaning that the
block list validation happens 10 times a minute. We think it is important to minimize the
reporting data size mostly from the point of view of the receiver. <<BR>> The
name node should have means to request complete reports from data nodes, which is required
in case the name node restarts.-~
+   a. Data nodes report just the delta with the removed blocks being explicitly marked as
such. <<BR>> ~-On startup the name node restores its state from a checkpoint.
The checkpoint stores information about files and their blocks, but not the block locations.
The locations are restored from the data node reports. That is why, at startup data nodes
need to report complete lists of stored blocks. Subsequent reports do not need to contain
all blocks, just the ones that have been modified since the last report. <<BR>>
Each data node reports its blocks in one hour intervals. In order to avoid traffic jams the
name node receives reports from different data nodes at different randomized times. Thus,
on e.g. a 600 node cluster the name node receives 10 reports per minute, meaning that the
block list validation happens 10 times a minute. We think it is important to minimize the
reporting data size mostly from the point of view of the receiver. <<BR>> The
name node should have means to request complete reports from data nodes, which is required
in case the name node restarts-~.<<BR>> [[http://issues.apache.org/jira/browse/HDFS-395|HDFS-395]]
-  1. (Performance) '''Fine grained name node synchronization.''' Rather than locking the
whole name node for every namespace update, the name node should have only a few synchronized
operations. These should be very efficient, not performing i/o and allocating few if any objects.
This synchronous name node kernel should be well-defined, so that developers were aware of
its boundaries.
+  1. (Performance) '''Fine grained name node synchronization.''' Rather than locking the
whole name node for every namespace update, the name node should have only a few synchronized
operations. These should be very efficient, not performing i/o and allocating few if any objects.
This synchronous name node kernel should be well-defined, so that developers were aware of
its boundaries.<<BR>> [[http://issues.apache.org/jira/browse/HADOOP-814|HADOOP-814]]
__Done__.
   1. (Performance) '''Compact name node''' data structure. In order to support large namespaces
the name node should efficiently represent internal data, which particularly mean eliminating
redundant block mappings. <<BR>> ''Currently DFS supports the following blocks
mappings:''
    a. ''Block to data node map (FSNamesystem.blocksMap)''
    a. ''Data node to block map (FSNamesystem.datanodeMap)''
    a. ''INode to block map (INode.blocks)''
-   a. ''Block to INode map (FSDirectory.activeBlocks)''
+   a. ''Block to INode map (FSDirectory.activeBlocks)''<<BR>>[[http://issues.apache.org/jira/browse/HADOOP-1687|HADOOP-1687]]
__Done__.
   1. (Performance) Improved '''block allocation schema'''. <<BR>> ''Currently
DFS randomly selects nodes from the set of data nodes that can fit the required amount of
data (a block).'' <<BR>> Things we need:
-   a. Rack locality awareness. First replica is placed on the client’s local node, the
second replica is placed on a node in the same rack as the client, and all other replicas
are placed randomly on the nodes outside the rack.
+   a. Rack locality awareness. First replica is placed on the client’s local node, the
second replica is placed on a node in the same rack as the client, and all other replicas
are placed randomly on the nodes outside the rack.<<BR>>[[http://issues.apache.org/jira/browse/HADOOP-692|HADOOP-692]]
__Done__.<<BR>>''Current replication policy is to place the first replica on the
local node, to place the second replica on a remote rack, and to place the third replica on
the same rack as the second one.''
    a. Nodes with high disk usage should be avoided for block placement.
    a. Nodes with high workload should be avoided for block placement.
    a. Distinguish between fast and slow nodes (performance– and communication–wise).
-  1. (Performance) Equalize '''disk space usage''' between the nodes. The name node should
regularly analyze data node disk states, and re-replicate blocks if some of them are “unusually”
low or high on storage.
+  1. (Performance) Equalize '''disk space usage''' between the nodes. The name node should
regularly analyze data node disk states, and re-replicate blocks if some of them are “unusually”
low or high on storage.<<BR>>[[http://issues.apache.org/jira/browse/HADOOP-1652|HADOOP-1652]]
__Done__.
-  1. (Performance) Commands “open” and “list” should '''not return the entire list
of block locations''', but rather return a fixed number of initial blocks. Reads will fetch
required block location when and if necessary.
+  1. (Performance) Commands “open” and “list” should '''not return the entire list
of block locations''', but rather return a fixed number of initial blocks. Reads will fetch
required block location when and if necessary.<<BR>>[[http://issues.apache.org/jira/browse/HADOOP-894|HADOOP-894]]
__Done__.
   1. (Performance) When the '''name node is busy''' it should reply to data nodes that they
should retry reporting their blocks later. The data nodes should retry reporting the blocks
in this case earlier than the regular report would occur.
   1. (Functionality) Implement '''atomic append''' command, also known as “record append”.
   1. (Functionality) Hadoop '''command shell''' should conform to the common shell conventions.

Mime
View raw message