hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Zheng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8430) Erasure coding: compute file checksum for striped files (stripe by stripe)
Date Fri, 17 Feb 2017 01:17:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870992#comment-15870992
] 

Kai Zheng commented on HDFS-8430:
---------------------------------

Hi Andrew,

Sorry for the late response.

Quite some time ago [~szetszwo] and I sorted out two approaches for this thru a long discussion:
{quote}
First, add a new API like getFileChecksum(int cell) using the New Algorithm 2. Using this
API users can compare a replicated file with a striped file, and if the file content are the
same, the file checksums will be the same. This version may incur bigger network traffic as
it needs to collect cells into client side for the computing.

Second, still change the existing API getFileChecksum() (no args) for striped files, using
the algorithm that specific to striped files, but similar to existing one for the replicated
files. No CRCs data will be collected centrally so no bigger network traffic involved as the
new API does. As the block layouts are different, the results will differ if it's used to
compare a striped file against a replicated file. It can be used to compare two files that
are of the same layout, either replicated or striped.
{quote}

Sub-tasks of this, HDFS-9694 and HDFS-9833 implemented the {{2nd}} approach, enhancing the
existing API getFileChecksum() (no args) to support striped files. It can be used to compare
two files that are of the same layout, either replicated or striped. I thought this is good
enough so far, for example, the distcp usage.

The {{1st}} approach can be used to compare a replicated file against a striped file. It needs
non-trivial development work and also involves big network traffic to centrally compute an
aggregate checksum results for a block group. IMO, we could continue to pend this for explicit
user requirement for the target behavior ({{compare a replicated file against a striped file}}).

So what's your thoughts?

> Erasure coding: compute file checksum for striped files (stripe by stripe)
> --------------------------------------------------------------------------
>
>                 Key: HDFS-8430
>                 URL: https://issues.apache.org/jira/browse/HDFS-8430
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>    Affects Versions: HDFS-7285
>            Reporter: Walter Su
>            Assignee: Kai Zheng
>            Priority: Blocker
>              Labels: hdfs-ec-3.0-must-do
>         Attachments: HDFS-8430-poc1.patch
>
>
> HADOOP-3981 introduces a  distributed file checksum algorithm. It's designed for replicated
block.
> {{DFSClient.getFileChecksum()}} need some updates, so it can work for striped block group.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message