hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13282) S3 blob etags to be made visible in S3A status/getFileChecksum() calls
Date Thu, 21 Dec 2017 15:05:02 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-13282:
       Resolution: Fixed
    Fix Version/s: 3.1.0
           Status: Resolved  (was: Patch Available)

committed; final regression test run against s3a ireland.

FWIW Checksum matching is used by distcp for incremental writes, which means it can't do incremental
copies between stores with different checksums. This patch doesn't address that, as even on
S3-S3 copies, multipart etags are not simple MD5 checkums. What we can rely on (hopefully!)
is that if two objects on the same store instance have the same etag, their data is equivalent.

What it would do is let anyone tracking the checksums of (src, dest) then be able to do a
check for a changed destination artifact before attempting an upload. 

> S3 blob etags to be made visible in S3A status/getFileChecksum() calls
> ----------------------------------------------------------------------
>                 Key: HADOOP-13282
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13282
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>             Fix For: 3.1.0
>         Attachments: HADOOP-13282-001.patch, HADOOP-13282-002.patch, HADOOP-13282-003.patch,
> If the etags of blobs were exported via {{getFileChecksum()}}, it'd be possible to probe
for a blob being in sync with a local file. Distcp could use this to decide whether to skip
a file or not.
> Now, there's a problem there: distcp needs source and dest filesystems to implement the
same algorithm. It'd only work out the box if you were copying between S3 instances. There
are also quirks with encryption and multipart: [s3 docs|http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html].
At the very least, it's something which could be used when indexing the FS, to check for changes

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message