hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8430) Erasure coding: compute file checksum for stripe files
Date Mon, 18 Jan 2016 09:32:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105005#comment-15105005
] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Status update

FileSystem:
* Added new API {{getFileChecksum(String algorithm)}} similar to the existing old API {{getFileChecksum}},
for a file for data all or a range.
* Added new API {{supportChecksumAlgorithm(String algorithm)}}.

Data transfer protocol:
* Added a new protocol method {{blockGroupChecksum(StripedBlockInfo blockGroupInfo, int mode,
BlockToken token)}} to calculate the MD5 aggregation result for a striping block group in
DataNode side, for both old and new APIs.
* Mode 1 for old API, simply summing all the block checksum data in the group one by one as
they're replicated blocks
* Mode 2 for new API, dividing and summing all the block checksum data in striping/cell sense.
* In both modes, in case data blocks missed, on demand recovering the blocks and recomputing
the block checksum data. No stored and discarded after used. Recovering logic shares the existing
codes in {{ErasureCodingWorker}} as possible via refactoring.
* Added a new protocol method {{rawBlockChecksum()}} to retrieve the whole raw block checksum
or CRC32 data. For simple, getting all the data in a pass, to consider multiple passes. This
is for the new API because a block group checksum computer needs to retrieve all the block
checksum data in the group to the place so able to reorganize in data strips and compute block
group checksum as contiguous blocks do.

In client side:
* Introduced {{ReplicatedFileChecksumComputer1}}, {{ReplicatedFileChecksumComputer2}}, {{StripedFileChecksumComputer1}}
and {{StripedFileChecksumComputer2}}, these sharing codes as possible and refactoring related
client side codes.
* ReplicatedFileChecksumComputer1 for the old API and replicated files, refactoring and using
existing logics.
* ReplicatedFileChecksumComputer2 for the new API and replicated files, similar to ReplicatedFileChecksumComputer1
but with awareness of cell. The block in its question should be exactly divided by the cell
size. Otherwise, cell64k like algorithm not supported exception.
* StripedFileChecksumComputer1 for the old API, summing all the block group checksum data
together, for each block group, calling blockGroupChecksum using mode 1.
* StripedFileChecksumComputer2 for the new API, summing all the block group checksum data
together, for each block group, calling blockGroupChecksum using mode 2.

In datanode side:
* Introduced {{BlockChecksumComputer}}, {{BlockGroupChecksumComputer1}} and {{BlockGroupChecksumComputer2}},
these sharing codes as possible and refactoring related DataNode side codes.
* BlockChecksumComputer for the old API and replicated blocks, refactoring and using existing
logics.
* BlockGroupChecksumComputer1 for the old API, summing all the block checksum data together
in the group, for each block, calling existing {{blockChecksum()}} method in the data transfer
protocol.
* BlockGroupChecksumComputer2 for the new API, summing all the strip checksum data together
in the group, for each block, calling the new method {{rawBlockChecksum()}} in the data transfer
protocol.

DistCp
* TODO, will use the two added new APIs to checksum and compare for the source and target
files.

The codes are still messy, and leave many blanks. Will attach a large patch for taking a look
when the two APIs are able to work as expected. Seems to break down. The function is small,
but gets big when implements. Very possibly missed some points, thanks for comments and suggestions,
as always.

> Erasure coding: compute file checksum for stripe files
> ------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed for replicated
block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped block group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message