hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: DistCp CRC failure modes
Date Wed, 27 Apr 2016 18:40:54 GMT
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <teabot@gmail.com> wrote:

> Hello,
>
> We are using DistCp V2 to replicate data between two HDFS file systems. We
> were working on the assumption that we could rely on CRC checks to ensure
> that the data was replicated correctly. However, after examining the DistCp
> source code it seems that there are edge cases where the CRCs could differ
> and yet the copy succeeds even when we are not skipping CRC checks.
>
> I'm wondering whether this is by design and if so, the reasoning behind
> it? If this is a bug, I'd like to raise an issue to fix it. If it is by
> design, I'd like to propose the introduction an option for stricter CRC
> checks.
>
> The code in question is contained in the method:
>
> org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)
>
> which can be seen here:
>
>
> https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457
>
>
> Specifically this code block suggests that if there is a failure when
> trying to read the source or target checksum then the method will return
> 'true', implying that the check succeeded. In actual fact we just failed to
> obtain the checksum and could perform no check.
>
>     try {
>       sourceChecksum = sourceChecksum != null ? sourceChecksum : sourceFS
>           .getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " +
> target, e);
>     }
>     return (sourceChecksum == null || targetChecksum == null ||
>             sourceChecksum.equals(targetChecksum));
>
> Ideally I'd like to be able to configure a check where we require that
> both the source and target CRCs are retrieved and compared, and if for any
> reason either of the CRCs retrievals fail then an exception is thrown. I do
> appreciate that some FileSystems cannot return CRCs but these could still
> be handled correctly as they would simply return null and not throw an
> exception (I assume).
>
> I'd appreciate any thoughts on this matter.
>
> Elliot.
>

Mime
View raw message