hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From w...@apache.org
Subject [25/50] [abbrv] hadoop git commit: HDFS-9088. Cleanup erasure coding documentation. Contributed by Andrew Wang.
Date Wed, 30 Sep 2015 15:42:25 GMT
HDFS-9088. Cleanup erasure coding documentation. Contributed by Andrew Wang.

Change-Id: Ic3ec1f29fef0e27c46fff66fd28a51f8c4c61e71


Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/e36129b6
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/e36129b6
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/e36129b6

Branch: refs/heads/trunk
Commit: e36129b61abd9edbdd77e053a5e2bfdad434d164
Parents: ced438a
Author: Zhe Zhang <zhezhang@cloudera.com>
Authored: Thu Sep 17 09:56:32 2015 -0700
Committer: Zhe Zhang <zhezhang@cloudera.com>
Committed: Thu Sep 17 09:56:32 2015 -0700

----------------------------------------------------------------------
 .../hadoop-hdfs/CHANGES-HDFS-EC-7285.txt        |   2 +
 .../src/site/markdown/HDFSErasureCoding.md      | 123 +++++++++----------
 2 files changed, 57 insertions(+), 68 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/e36129b6/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
----------------------------------------------------------------------
diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
index acf62cb..0345a54 100755
--- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
+++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
@@ -427,3 +427,5 @@
 
     HDFS-8899. Erasure Coding: use threadpool for EC recovery tasks on DataNode.
     (Rakesh R via zhz)
+
+    HDFS-9088. Cleanup erasure coding documentation. (wang via zhz)

http://git-wip-us.apache.org/repos/asf/hadoop/blob/e36129b6/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
----------------------------------------------------------------------
diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
index 44c209e..3040bf5 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
@@ -19,108 +19,95 @@ HDFS Erasure Coding
     * [Purpose](#Purpose)
     * [Background](#Background)
     * [Architecture](#Architecture)
-    * [Hardware resources](#Hardware_resources)
     * [Deployment](#Deployment)
-        * [Configuration details](#Configuration_details)
-        * [Deployment details](#Deployment_details)
+        * [Cluster and hardware configuration](#Cluster_and_hardware_configuration)
+        * [Configuration keys](#Configuration_keys)
         * [Administrative commands](#Administrative_commands)
 
 Purpose
 -------
-  Replication is expensive -- the default 3x replication scheme has 200% overhead in storage
space and other resources (e.g., network bandwidth).
-  However, for “warm” and “cold” datasets with relatively low I/O activities, secondary
block replicas are rarely accessed during normal operations, but still consume the same amount
of resources as the primary ones.
+  Replication is expensive -- the default 3x replication scheme in HDFS has 200% overhead
in storage space and other resources (e.g., network bandwidth).
+  However, for warm and cold datasets with relatively low I/O activities, additional block
replicas are rarely accessed during normal operations, but still consume the same amount of
resources as the first replica.
 
-  Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
which provides the same level of fault tolerance with much less storage space. In typical
Erasure Coding(EC) setups, the storage overhead is ≤ 50%.
+  Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication,
which provides the same level of fault-tolerance with much less storage space. In typical
Erasure Coding (EC) setups, the storage overhead is no more than 50%.
 
 Background
 ----------
 
   In storage systems, the most notable usage of EC is Redundant Array of Inexpensive Disks
(RAID). RAID implements EC through striping, which divides logically sequential data (such
as a file) into smaller units (such as bit, byte, or block) and stores consecutive units on
different disks. In the rest of this guide this unit of striping distribution is termed a
striping cell (or cell). For each stripe of original data cells, a certain number of parity
cells are calculated and stored -- the process of which is called encoding. The error on any
striping cell can be recovered through decoding calculation based on surviving data and parity
cells.
 
-  Integrating the EC function with HDFS could get storage efficient deployments. It can provide
similar data tolerance as traditional HDFS replication based deployments but it stores only
one original replica data and parity cells.
-  In a typical case, A file with 6 blocks will actually be consume space of 6*3 = 18 blocks
with replication factor 3. But with EC (6 data,3 parity) deployment, it will only consume
space of 9 blocks.
+  Integrating EC with HDFS can improve storage efficiency while still providing similar data
durability as traditional replication-based HDFS deployments.
+  As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk
space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks of disk space.
 
 Architecture
 ------------
-  In the context of EC, striping has several critical advantages. First, it enables online
EC which bypasses the conversion phase and immediately saves storage space. Online EC also
enhances sequential I/O performance by leveraging multiple disk spindles in parallel; this
is especially desirable in clusters with high end networking  . Second, it naturally distributes
a small file to multiple DataNodes and eliminates the need to bundle multiple files into a
single coding group. This greatly simplifies file operations such as deletion, quota reporting,
and migration between federated namespaces.
+  In the context of EC, striping has several critical advantages. First, it enables online
EC (writing data immediately in EC format), avoiding a conversion phase and immediately saving
storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk
spindles in parallel; this is especially desirable in clusters with high end networking. Second,
it naturally distributes a small file to multiple DataNodes and eliminates the need to bundle
multiple files into a single coding group. This greatly simplifies file operations such as
deletion, quota reporting, and migration between federated namespaces.
 
-  As in general HDFS clusters, small files could account for over 3/4 of total storage consumption.
So, In this first phase of erasure coding work, HDFS supports striping model. In the near
future, HDFS will supports contiguous layout as second second phase work. So this guide focuses
more on striping model EC.
+  In typical HDFS clusters, small files can account for over 3/4 of total storage consumption.
To better support small files, in this first phase of work HDFS supports EC with striping.
In the future, HDFS will also support a contiguous EC layout. See the design doc and discussion
on [HDFS-7285](https://issues.apache.org/jira/browse/HDFS-7285) for more information.
 
- *  **NameNode Extensions** - Under the striping layout, a HDFS file is logically composed
of block groups, each of which contains a certain number of   internal blocks.
-   To eliminate the need for NameNode to monitor all internal blocks, a new hierarchical
block naming protocol is introduced, where the ID of a block group can be inferred from any
of its internal blocks. This allows each block group to be managed as a new type of BlockInfo
named BlockInfoStriped, which tracks its own internal blocks by attaching an index to each
replica location.
+ *  **NameNode Extensions** - Striped HDFS files are logically composed of block groups,
each of which contains a certain number of internal blocks.
+    To reduce NameNode memory consumption from these additional blocks, a new hierarchical
block naming protocol was introduced. The ID of a block group can be inferred from the ID
of any of its internal blocks. This allows management at the level of the block group rather
than the block.
 
- *  **Client Extensions** - The basic principle behind the extensions is to allow the client
node to work on multiple internal blocks in a block group in
-    parallel.
+ *  **Client Extensions** - The client read and write paths were enhanced to work on multiple
internal blocks in a block group in parallel.
     On the output / write path, DFSStripedOutputStream manages a set of data streamers, one
for each DataNode storing an internal block in the current block group. The streamers mostly
     work asynchronously. A coordinator takes charge of operations on the entire block group,
including ending the current block group, allocating a new block group, and so forth.
     On the input / read path, DFSStripedInputStream translates a requested logical byte range
of data as ranges into internal blocks stored on DataNodes. It then issues read requests in
     parallel. Upon failures, it issues additional read requests for decoding.
 
- *  **DataNode Extensions** - ErasureCodingWorker(ECWorker) is for reconstructing erased
erasure coding blocks and runs along with the Datanode process. Erased block details would
have been found out by Namenode ReplicationMonitor thread and sent to Datanode via its heartbeat
responses as discussed in the previous sections. For each reconstruction task,
-   i.e. ReconstructAndTransferBlock, it will start an internal daemon thread that performs
3 key tasks:
+ *  **DataNode Extensions** - The DataNode runs an additional ErasureCodingWorker (ECWorker)
task for background recovery of failed erasure coded blocks. Failed EC blocks are detected
by the NameNode, which then chooses a DataNode to do the recovery work. The recovery task
is passed as a heartbeat response. This process is similar to how replicated blocks are re-replicated
on failure. Reconstruction performs three key tasks:
 
-      _1.Read the data from source nodes:_ For reading the data blocks from different source
nodes, it uses a dedicated thread pool.
-         The thread pool is initialized when ErasureCodingWorker initializes. Based on the
EC policy, it schedules the read requests to all source targets and ensures only to read
-         minimum required input blocks for reconstruction.
+      1. _Read the data from source nodes:_ Input data is read in parallel from source nodes
using a dedicated thread pool.
+        Based on the EC policy, it schedules the read requests to all source targets and
reads only the minimum number of input blocks for reconstruction.
 
-      _2.Decode the data and generate the output data:_ Actual decoding/encoding is done
by using RawErasureEncoder API currently.
-        All the erased data and/or parity blocks will be recovered together.
+      1. _Decode the data and generate the output data:_ New data and parity blocks are decoded
from the input data. All missing data and parity blocks are decoded together.
 
-     _3.Transfer the generated data blocks to target nodes:_ Once decoding is finished, it
will encapsulate the output data to packets and send them to
-        target Datanodes.
-   To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster
to have different replication and EC policies.
-*   **ErasureCodingPolicy**
-    Information on how to encode/decode a file is encapsulated in an ErasureCodingPolicy
class. Each policy is defined by the following 2 pieces of information:
-    _1.The ECScema: This includes the numbers of data and parity blocks in an EC group (e.g.,
6+3), as well as the codec algorithm (e.g., Reed-Solomon).
+      1. _Transfer the generated data blocks to target nodes:_ Once decoding is finished,
the recovered blocks are transferred to target DataNodes.
 
-    _2.The size of a striping cell.
+ *  **ErasureCoding policy**
+    To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster
to have different replication and EC policies.
+    Information on how to encode/decode a file is encapsulated in an ErasureCodingPolicy
class. Each policy is defined by the following 2 pieces of information:
 
-   Client and Datanode uses EC codec framework directly for doing the endoing/decoding work.
+      1. _The ECSchema:_ This includes the numbers of data and parity blocks in an EC group
(e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon).
 
- *  **Erasure Codec Framework**
-     We support a generic EC framework which allows system users to define, configure, and
deploy multiple coding schemas such as conventional Reed-Solomon, HitchHicker and
-     so forth.
-     ErasureCoder is provided to encode or decode for a block group in the middle level,
and RawErasureCoder is provided to perform the concrete algorithm calculation in the low level.
ErasureCoder can
-     combine and make use of different RawErasureCoders for tradeoff. We abstracted coder
type, data blocks size, parity blocks size into ECSchema. A default system schema using RS
(6, 3) is built-in.
-     For the system default codec Reed-Solomon we implemented both RSRawErasureCoder in pure
Java and NativeRawErasureCoder based on Intel ISA-L. Below is the performance
-     comparing for different coding chunk size. We can see that the native coder can outperform
the Java coder by up to 35X.
+      1. _The size of a striping cell._ This determines the granularity of striped reads
and writes, including buffer sizes and encoding work.
 
-     _Intel® Storage Acceleration-Library(Intel® ISA-L)_ ISA-L is an Open Source Version
and is a collection of low-level functions used in storage applications.
-     The open source version contains fast erasure codes that implement a general Reed-Solomon
type encoding for blocks of data that helps protect against
-     erasure of whole blocks. The general ISA-L library contains an expanded set of functions
used for data protection, hashing, encryption, etc. By
-     leveraging instruction sets like SSE, AVX and AVX2, the erasure coding functions are
much optimized and outperform greatly on IA platforms. ISA-L
-     supports Linux, Windows and other platforms as well. Additionally, it also supports
incremental coding so applications don’t have to wait all source
-     blocks to be available before to perform the coding, which can be used in HDFS.
+    Currently, HDFS supports the Reed-Solomon and XOR erasure coding algorithms. Additional
algorithms are planned as future work.
+    The system default scheme is Reed-Solomon (6, 3) with a cell size of 64KB.
 
-Hardware resources
-------------------
-  For using EC feature, you need to prepare for the following.
-    Depending on the ECSchemas used, we need to have minimum number of Datanodes available
in the cluster. Example if we use ReedSolomon(6, 3) ECSchema,
-    then minimum nodes required is 9 to succeed the write. It can tolerate up to 3 failures.
 
 Deployment
 ----------
 
-### Configuration details
+### Cluster and hardware configuration
+
+  Erasure coding places additional demands on the cluster in terms of CPU and network.
 
-  In the EC feature, raw coders are configurable. So, users need to decide the RawCoder algorithms.
-  Configure the customized algorithms with configuration key "*io.erasurecode.codecs*".
+  Encoding and decoding work consumes additional CPU on both HDFS clients and DataNodes.
 
-  Default Reed-Solomon based raw coders available in built, which can be configured by using
the configuration key "*io.erasurecode.codec.rs.rawcoder*".
-  And also another default raw coder available if XOR based raw coder. Which could be configured
by using "*io.erasurecode.codec.xor.rawcoder*"
+  Erasure coded files are also spread across racks for rack fault-tolerance.
+  This means that when reading and writing striped files, most operations are off-rack.
+  Network bisection bandwidth is thus very important.
 
-  _EarasureCodingWorker Confugurations:_
-    dfs.datanode.stripedread.threshold.millis - Threshold time for polling timeout for read
service. Default value is 5000
-    dfs.datanode.stripedread.threads – Number striped read thread pool threads. Default
value is 20
-    dfs.datanode.stripedread.buffer.size - Buffer size for reader service. Default value
is 256 * 1024
+  For rack fault-tolerance, it is also important to have at least as many racks as the configured
EC stripe width.
+  For the default EC policy of RS (6,3), this means minimally 9 racks, and ideally 10 or
11 to handle planned and unplanned outages.
+  For clusters with fewer racks than the stripe width, HDFS cannot maintain rack fault-tolerance,
but will still attempt
+  to spread a striped file across multiple nodes to preserve node-level fault-tolerance.
 
-### Deployment details
+### Configuration keys
 
-  With the striping model, client machine is responsible for do the EC endoing and tranferring
data to the datanodes.
-  So, EC with striping model expects client machines with hghg end configurations especially
of CPU and network.
+  The codec implementation for Reed-Solomon and XOR can be configured with the following
client and DataNode configuration keys:
+  `io.erasurecode.codec.rs.rawcoder` and `io.erasurecode.codec.xor.rawcoder`.
+  The default implementations for both of these codecs are pure Java.
+
+  Erasure coding background recovery work on the DataNodes can also be tuned via the following
configuration parameters:
+
+  1. `dfs.datanode.stripedread.threshold.millis` - Timeout for striped reads. Default value
is 5000 ms.
+  1. `dfs.datanode.stripedread.threads` - Number of concurrent reader threads. Default value
is 20 threads.
+  1. `dfs.datanode.stripedread.buffer.size` - Buffer size for reader service. Default value
is 256KB.
 
 ### Administrative commands
- ErasureCoding command-line is provided to perform administrative commands related to ErasureCoding.
This can be accessed by executing the following command.
+
+  HDFS provides an `erasurecode` subcommand to perform administrative commands related to
erasure coding.
 
        hdfs erasurecode [generic options]
          [-setPolicy [-s <policyName>] <path>]
@@ -131,18 +118,18 @@ Deployment
 
 Below are the details about each command.
 
-*  **SetPolicy command**: `[-setPolicy [-s <policyName>] <path>]`
+ *  `[-setPolicy [-s <policyName>] <path>]`
 
-    SetPolicy command is used to set an ErasureCoding policy on a directory at the specified
path.
+    Sets an ErasureCoding policy on a directory at the specified path.
 
-      `path`: Refer to a pre-created directory in HDFS. This is a mandatory parameter.
+      `path`: An directory in HDFS. This is a mandatory parameter. Setting a policy only
affects newly created files, and does not affect existing files.
 
-      `policyName`: This is an optional parameter, specified using ‘-s’ flag. Refer to
the name of ErasureCodingPolicy to be used for encoding files under this directory. If not
specified the system default ErasureCodingPolicy will be used.
+      `policyName`: The ErasureCoding policy to be used for files under this directory. This
is an optional parameter, specified using ‘-s’ flag. If no policy is specified, the system
default ErasureCodingPolicy will be used.
 
-*  **GetPolicy command**: `[-getPolicy <path>]`
+ *  `[-getPolicy <path>]`
 
-     GetPolicy command is used to get details of the ErasureCoding policy of a file or directory
at the specified path.
+     Get details of the ErasureCoding policy of a file or directory at the specified path.
 
-*  **ListPolicies command**:  `[-listPolicies]`
+ *  `[-listPolicies]`
 
-     Lists all supported ErasureCoding policies. For setPolicy command, one of these policies'
name should be provided.
\ No newline at end of file
+     Lists all supported ErasureCoding policies. These names are suitable for use with the
`setPolicy` command.


Mime
View raw message