hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rakesh R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9833) Erasure coding: recomputing block checksum on the fly by reconstructing the missed/corrupt block data
Date Tue, 03 May 2016 07:13:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268251#comment-15268251

Rakesh R commented on HDFS-9833:

Following is the brief idea about the proposed approach. Kindly go through this and would
be great to see the feedback on this. Thanks!

In our existing striped checksum logic, client is connecting to the first datanode in the
block locations and sending {{Op.BLOCK_GROUP_CHECKSUM}} command. He will iterate over {{ecPolicy.getNumDataUnits()}}
datanodes and invokes {{Op.BLOCK_CHECKSUM}} command one by one. During these operations it
can hit {{IOException}} and fail the checksum call.

To begin with, I think will catch generic {{IOException}} while performing operation on a
datanode. The block corresponding to the failed datanode will be chosen for reconstruction
and then recompute checksum with the reconstructed block data.
# Datanode side changes:
If there is an IOException while performing {{Op.BLOCK_CHECKSUM}} command then it will consider
this block for reconstruction and calculate its checksum. Again the reconstruction errors
will fail the checksum call.
# Client side changes:
Presently {{FileChecksumHelper#checksumBlockGroup()}} function is throwing IOException back
to the client if the first datanode has errors, instead will try connecting to {{#getNumParityUnits()}}
number of datanodes before failing the checksum operation.

Thanks [~umamaheswararao] for the offline discussions.

> Erasure coding: recomputing block checksum on the fly by reconstructing the missed/corrupt
block data
> -----------------------------------------------------------------------------------------------------
>                 Key: HDFS-9833
>                 URL: https://issues.apache.org/jira/browse/HDFS-9833
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Kai Zheng
>            Assignee: Rakesh R
>              Labels: hdfs-ec-3.0-must-do
> As discussed in HDFS-8430 and HDFS-9694, to compute striped file checksum even some of
striped blocks are missed, we need to consider recomputing block checksum on the fly for the
missed/corrupt blocks. To recompute the block checksum, the block data needs to be reconstructed
by erasure decoding, and the main needed codes for the block reconstruction could be borrowed
from HDFS-9719, the refactoring of the existing {{ErasureCodingWorker}}. In EC worker, reconstructed
blocks need to be written out to target datanodes, but here in this case, the remote writing
isn't necessary, as the reconstructed block data is only used to recompute the checksum.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message