hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhe Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS
Date Sat, 17 Jan 2015 01:29:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281111#comment-14281111

Zhe Zhang commented on HDFS-7285:

bq. The unreserved ID space currently is used only by blocks. After the division, the IDs
for BlockGroup could possibly be already used by some existing blocks. How to upgrade the
cluster to fix this problem?
OK I understand the issue now. Do you know when HDFS started to use sequentially generated
(rather than random) block IDs -- starting from which version? I guess we can still attempt
to allocate block group IDs from the second half of the unreserved ID space, and we check
for conflicts in each allocation; if there is a conflict we move the pointer forward to the
conflicting ID. In that scenario, to tell if a block is regular or striped, we need to parse
out the middle part of the ID and check if it exists in the map of block groups. 

bq. I thought the strip size is 1MB according to the figure in Data Striping Support in HDFS
Good catch. I need to update that design doc to match the 64KB default stripe cell size. BTW
1MB is the I/O buffer size (see parameters _C_ and _B_ on page 8 of the [master design doc
| https://issues.apache.org/jira/secure/attachment/12687803/HDFSErasureCodingDesign-20141217.pdf].

bq. Anyway, small files (say < 10MB) may not be EC at all since the disk space save is
not much and the namespace usage is increased significantly.
I agree it doesn't make much sense to encode files as small as a few MB. The current fsimage
analysis didn't further categorize files under 1 block. But my guess is they only contribute
a minor portion of space usage in most clusters. I'll try to run the analysis again to verify.

bq. BTW, the fsimage analysis is very nice. There are two more types of cost not covered,
CPU overhead and replication cost (re-constructing the EC blocks). How could we quantify them?
Thanks and I really like the suggestion. CPU and I/O bandwidth usage is hard to simulate in
a simple analyzer. We'll make sure it's included in the real system test plan.

bq. The failure cases for write seems not yet covered – what happen if some datanodes fails
during a write?
I believe [~libo-intel] will share more details under HDFS-7545 soon. There is a range of
policies we can adopt, the most strict being to return I/O error when any one target DN fails.
In a "smarter" policy. the application can keep writing until _m_ DNs fail, which _m_ is equal
to the number of parity blocks in the schema.

bq. Datanode failure may generate a EC strom – say a datanode has 40TB data, it requires
accessing 240TB data for recovering it. It is in the order of PB for rack failure. How could
we solve this problem?
I think this is an inevitable challenge with EC. The best we can do is to schedule EC recovery
tasks together with {{UnderReplicatedBlocks}} with appropriate priority settings. This way
blocks are recovered when the system is relatively idle. When lost blocks are accessed it
can recovered on-the-fly, but traffic from online recovery should be much lower.

> Erasure Coding Support inside HDFS
> ----------------------------------
>                 Key: HDFS-7285
>                 URL: https://issues.apache.org/jira/browse/HDFS-7285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Weihua Jiang
>            Assignee: Zhe Zhang
>         Attachments: ECAnalyzer.py, ECParser.py, HDFSErasureCodingDesign-20141028.pdf,
HDFSErasureCodingDesign-20141217.pdf, fsimage-analysis-20150105.pdf
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice of data
reliability, comparing to the existing HDFS 3-replica approach. For example, if we use a 10+4
Reed Solomon coding, we can allow loss of 4 blocks, with storage overhead only being 40%.
This makes EC a quite attractive alternative for big data storage, particularly for cold data.

> Facebook had a related open source project called HDFS-RAID. It used to be one of the
contribute packages in HDFS but had been removed since Hadoop 2.0 for maintain reason. The
drawbacks are: 1) it is on top of HDFS and depends on MapReduce to do encoding and decoding
tasks; 2) it can only be used for cold files that are intended not to be appended anymore;
3) the pure Java EC coding implementation is extremely slow in practical use. Due to these,
it might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that gets rid of
any external dependencies, makes it self-contained and independently maintained. This design
lays the EC feature on the storage type support and considers compatible with existing HDFS
features like caching, snapshot, encryption, high availability and etc. This design will also
support different EC coding schemes, implementations and policies for different deployment
scenarios. By utilizing advanced libraries (e.g. Intel ISA-L library), an implementation can
greatly improve the performance of EC encoding/decoding and makes the EC solution even more
attractive. We will post the design document soon. 

This message was sent by Atlassian JIRA

View raw message