hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "DFS requirements" by KonstantinShvachko
Date Fri, 14 Jul 2006 01:15:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by KonstantinShvachko:
http://wiki.apache.org/lucene-hadoop/DFS_requirements

------------------------------------------------------------------------------
   14. (Specification) Define '''invariants for read and append''' commands. A formalization
of DFS consistency model with underlying assumptions and the result guarantees.
   15. (Performance) Check sum data should not be stored as a separate DFS '''crc-file''',
but rather maintained by a data node per locally stored block copy. This will reduce name
node operations and improve read data locality for maps.
      a. '''CRC scanning'''. We should dedicate up to 1% of the disk bandwidth on a data node
to reading back the blocks and validating their CRCs. The results should be logged and reported
in the DFS UI.
-  16. (Performance) DFS should operate with '''constant size file blocks'''. [[BR]] ''Currently
internally DFS supposes that blocks of the same file can have different sizes. In practice
all of them except for the last one have the same size. The code could be optimized if the
above assumption is removed.''
+  16. (Performance) DFS should operate with '''constant size file blocks'''. [[BR]] ''Currently
internally DFS supposes that blocks of the same file can have different sizes. In practice
all of them except for the last one have the same size. The code could be optimized if the
above assumption is removed.'' [[BR]] ~-Each block can be of any size up to the file’s fixed
block size. The DFS client provides an API to report gaps and/or an API option to skip gaps
or see them as NULLs. The reporting is done at the data node level allowing us to remove all
the size data & logic at the name node level.-~
   17. (Performance) Client writes should '''flush directly to DFS''' based on the buffer
size set at creation of the stream rather than collecting data in a temporary file on a local
disk.
   18. (Performance) Currently '''data nodes report''' the entire list of stored blocks to
the name node once in an hour. Most of this information is redundant. Processing of large
reports reduces the name node availability for application tasks. [[BR]] Possible solutions:

      a. Data nodes report a portion (e.g. 20%, or bounded by the total size of transmitted
data) of their blocks but (5 times) more often.

Mime
View raw message