hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS
Date Mon, 10 Aug 2015 22:15:53 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680864#comment-14680864

Tsz Wo Nicholas Sze commented on HDFS-7285:

> A rebase workflow gets very difficult without the ability to squash patches, since you
would otherwise spend a lot of time fixing conflicts in intermediate commits that don't even
show up in HEAD. We have 180 some commits in the EC branch now, and is definitely at the "very
difficult" stage of rebasing.

If it is true, choosing the git rebase workflow seems a mistake.  We probably should switch
to git merge workflow then.

> Moving refactors from branches to trunk is standard practice we've done many times before
and is something I recommended we do here too.

The standard practice is to have a refactor patch committed to trunk first and then use git
to merge to the branch, not the other way around, not using separated patches.  Committing
similar code to different branches by different patches makes the merge hard since the branch
and trunk are going to have a lot of conflicts as a result of such practice.

Moreover, if separated patches are needed, we usually ask the original contributors to contribute
a merge patch for for committing patches to different branches so that the original contributors
get the credits but not a different contributor.  An original contributor A did the hard work
to come up patches which were only committed to the development branch.  Another contributor
B did the easy work to post similar patches for committing to trunk.  After the development
branch got merged to trunk, no one would care about the development branch and all the contributions
in trunk are associated to B but not the original contributor A.  Although the contributor
B may not have the intention, it does look like that B is stealing the credits from A.  Do
you think that it is a problem, [~andrew.wang]?

> Erasure Coding Support inside HDFS
> ----------------------------------
>                 Key: HDFS-7285
>                 URL: https://issues.apache.org/jira/browse/HDFS-7285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Weihua Jiang
>            Assignee: Zhe Zhang
>         Attachments: Consolidated-20150707.patch, Consolidated-20150806.patch, Consolidated-20150810.patch,
ECAnalyzer.py, ECParser.py, HDFS-7285-initial-PoC.patch, HDFS-7285-merge-consolidated-01.patch,
HDFS-7285-merge-consolidated-trunk-01.patch, HDFS-7285-merge-consolidated.trunk.03.patch,
HDFS-7285-merge-consolidated.trunk.04.patch, HDFS-EC-Merge-PoC-20150624.patch, HDFS-EC-merge-consolidated-01.patch,
HDFS-bistriped.patch, HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf,
HDFSErasureCodingDesign-20150204.pdf, HDFSErasureCodingDesign-20150206.pdf, HDFSErasureCodingPhaseITestPlan.pdf,
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice of data
reliability, comparing to the existing HDFS 3-replica approach. For example, if we use a 10+4
Reed Solomon coding, we can allow loss of 4 blocks, with storage overhead only being 40%.
This makes EC a quite attractive alternative for big data storage, particularly for cold data.

> Facebook had a related open source project called HDFS-RAID. It used to be one of the
contribute packages in HDFS but had been removed since Hadoop 2.0 for maintain reason. The
drawbacks are: 1) it is on top of HDFS and depends on MapReduce to do encoding and decoding
tasks; 2) it can only be used for cold files that are intended not to be appended anymore;
3) the pure Java EC coding implementation is extremely slow in practical use. Due to these,
it might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that gets rid of
any external dependencies, makes it self-contained and independently maintained. This design
lays the EC feature on the storage type support and considers compatible with existing HDFS
features like caching, snapshot, encryption, high availability and etc. This design will also
support different EC coding schemes, implementations and policies for different deployment
scenarios. By utilizing advanced libraries (e.g. Intel ISA-L library), an implementation can
greatly improve the performance of EC encoding/decoding and makes the EC solution even more
attractive. We will post the design document soon. 

This message was sent by Atlassian JIRA

View raw message