hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts
Date Wed, 21 Feb 2018 01:19:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370820#comment-16370820
] 

ASF GitHub Bot commented on HDFS-13056:
---------------------------------------

GitHub user dennishuo opened a pull request:

    https://github.com/apache/hadoop/pull/344

    HDFS-13056. Add support for a new COMPOSITE_CRC FileChecksum which is comparable between
different block layouts and between striped/replicated files

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dennishuo/hadoop add-composite-crc32

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #344
    
----
commit de06097fa2f4c511d5a107d997c7dfa5862ada82
Author: Dennis Huo <dhuo@...>
Date:   2018-01-24T23:04:29Z

    Add support for a new COMPOSITE_CRC FileChecksum.
    
    Adds new file-level ChecksumCombineMode options settable through config and
    lower-level BlockChecksumOptions to indicate block-checksum types supported by
    both blockChecksum and blockGroupChecksum in DataTransferProtocol.
    
    CRCs are composed such that they are agnostic to block/chunk/cell layout and
    thus can be compared between replicated-files and striped-files of
    different underlying blocksize, bytes-per-crc, and cellSize settings.
    
    Does not alter default behavior, and doesn't touch the data-read or
    data-write paths at all.

commit 3f8fd5ef9da8c312f60430622d3c95f80cb1fde2
Author: Dennis Huo <dhuo@...>
Date:   2018-02-08T00:21:14Z

    Fix byte-length property for CRC FileChecksum

commit 1a326e38505bacd6b40a682668f36c2aa1047f86
Author: Dennis Huo <dhuo@...>
Date:   2018-02-19T02:53:03Z

    Add unittest for CrcUtil.
    
    Minor optimization by starting multiplier at x^8 and fix the behavior of
    composing a zero-length crcB.

commit d7c2bc739f3cff0d8ae72bb4f2a940eb5b733279
Author: Dennis Huo <dhuo@...>
Date:   2018-02-20T00:47:50Z

    Refactor StripedBlockChecksumReconstructor for easier reuse with COMPOSITE_CRC.
    
    Update BlockChecksumHelper's CRC composition to use the same data buffer
    used in MD5 case, and factor our shared logic from the
    StripedBlockChecksumReconstructor into an abstract base class so that
    reconstruction logic can be shared between MD5CRC and COMPOSITE_CRC.

commit ac38f404f1d15c9846f58acf297c7e242c3f8bce
Author: Dennis Huo <dhuo@...>
Date:   2018-02-20T03:05:41Z

    Extract a helper class CrcComposer.
    
    Encapsulate all the CRC internals such as tracking the CRC polynomial,
    precomputing the monomial, etc., into this class so taht BlockChecksumHelper
    and FileChecksumHelper only need to interact with the clean interfaces
    of CrcComposer.

commit 8f7b9fd6f93c8358dd0c4899e41d2a993bcc6294
Author: Dennis Huo <dhuo@...>
Date:   2018-02-20T03:40:33Z

    Add StripedBlockChecksumCompositeCrcReconstructor.
    
    Wire it in to BlockChecksumHelper and use CrcComposer to regenerate
    striped composite CRCs for missing EC data blocks.

commit fd2fc3408346aeb177eaeda50919995ee3c02cab
Author: Dennis Huo <dhuo@...>
Date:   2018-02-20T21:56:07Z

    Add end-to-end test coverage for COMPOSITE_CRC.
    
    Extract hooks in TestFileChecksum to allow a subclass to share core
    tests while modifying expectations of a subset of tests; add
    TestFileChecksumCompositeCrc which extends TestFileChecksum to
    apply the same test suite to COMPOSITE_CRC, and add a test case
    for comparing two replicated files with different block sizes.
    Test confirms that MD5CRC will yield different checksums
    between replicated vs striped, and two replicated files with
    different block sizes, while COMPOSITE_CRC yields the same
    checksum for all cases.

commit 5cd2d08f2be672e79d931ebb6f89541f38334f0b
Author: Dennis Huo <dhuo@...>
Date:   2018-02-20T23:44:11Z

    Add unittest for CrcComposer.
    
    Fix a bug in handling byte-array updates with nonzero offset.

commit e65248b077d4e1ad00888112de877afed86dad03
Author: Dennis Huo <dhuo@...>
Date:   2018-02-21T00:08:05Z

    Remove STRIPED_CRC as a BlockChecksumType.
    
    Refactor to just use stripeLength with COMPOSITE_CRC, where non-striped
    COMPOSITE_CRC is just an edge case where stripeLength is longer than the
    data range.

commit c2a7701246c07a4906d7540d6bc496364239dafc
Author: Dennis Huo <dhuo@...>
Date:   2018-02-21T01:02:08Z

    Support file-attribute propagation of bytePerCrc in CompositeCrcFileChecksum.
    
    Additionally, fix up remaining TODOs; add wrappers for late-evaluating
    hex format of CRCs to pass into debug statements and clean up logging
    logic.

----


> Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13056
>                 URL: https://issues.apache.org/jira/browse/HDFS-13056
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, distcp, erasure-coding, federation, hdfs
>    Affects Versions: 3.0.0
>            Reporter: Dennis Huo
>            Priority: Major
>         Attachments: HDFS-13056-branch-2.8.001.patch, HDFS-13056-branch-2.8.poc1.patch,
HDFS-13056.001.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf,
hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in [https://issues-test.apache.org/jira/browse/HADOOP-3981] and
ever since then has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are
already stored as part of datanode metadata, and the MD5 approach is used to compute an aggregate
value in a distributed manner, with individual datanodes computing the MD5-of-CRCs per-block
in parallel, and the HDFS client computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that this FileChecksum
is sensitive to the internal block-size and chunk-size configuration, and thus different
HDFS files with different block/chunk settings cannot be compared. More commonly, one might
have different HDFS clusters which use different block sizes, in which case any data migration
won't be able to use the FileChecksum for distcp's rsync functionality or for verifying end-to-end
data integrity (on top of low-level data integrity checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 during the
addition of checksum support for striped erasure-coded files; while there was some discussion
of using CRC composability, it still ultimately settled on hierarchical MD5 approach, which also adds
the problem that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses CRC composition
to remain completely chunk/block agnostic, and allows comparison between striped vs replicated
files, between different HDFS instances, and possible even between HDFS and other external
storage systems. This feature can also be added in-place to be compatible with existing block
metadata, and doesn't need to change the normal path of chunk verification, so is minimally
invasive. This also means even large preexisting HDFS deployments could adopt this feature
to retroactively sync data. A detailed design document can be found here: https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message