hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Fabbri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13282) S3 blob etags to be made visible in status/getFileChecksum() calls
Date Thu, 21 Dec 2017 00:24:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299308#comment-16299308

Aaron Fabbri commented on HADOOP-13282:

Nice.  This looks good.  I see that checksum equality depends on the algorithm being the same
as well as the checksum bytes.  It looks like this does the right thing with multiple filesystems.

Related question: I noticed you don't override {{FileChecksum#getChecksumOpt()}}.  The only
use I found was in {{hadoop.tools.mapred.RetriableFileCopyCommand()}.  It looks like it is
trying to preserve the checksum type as it writes copies to the destination.  Now that we
support checksums, do we need to implement the other create() call that takes ChecksumOpts?
 I'm not sure what the semantics are there but we could throw an exception if options are
specified and the type doesn't match. As is, we just ignore any options passed (arg discarded
in FileSystem#create()).. Should we at least document the behavior?

Other than that question.. LGTM, +1.

> S3 blob etags to be made visible in status/getFileChecksum() calls
> ------------------------------------------------------------------
>                 Key: HADOOP-13282
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13282
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13282-001.patch, HADOOP-13282-002.patch, HADOOP-13282-003.patch,
> If the etags of blobs were exported via {{getFileChecksum()}}, it'd be possible to probe
for a blob being in sync with a local file. Distcp could use this to decide whether to skip
a file or not.
> Now, there's a problem there: distcp needs source and dest filesystems to implement the
same algorithm. It'd only work out the box if you were copying between S3 instances. There
are also quirks with encryption and multipart: [s3 docs|http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html].
At the very least, it's something which could be used when indexing the FS, to check for changes

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message