hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-15999) [s3a] Better support for out-of-band operations
Date Wed, 12 Dec 2018 00:09:00 GMT
Sean Mackrory created HADOOP-15999:

             Summary: [s3a] Better support for out-of-band operations
                 Key: HADOOP-15999
                 URL: https://issues.apache.org/jira/browse/HADOOP-15999
             Project: Hadoop Common
          Issue Type: New Feature
            Reporter: Sean Mackrory

S3Guard was initially done on the premise that a new MetadataStore would be the source of
truth, and that it wouldn't provide guarantees if updates were done without using S3Guard.

I've been seeing increased demand for better support for scenarios where operations are done
on the data that can't reasonably be done with S3Guard involved. For example:
* A file is deleted using S3Guard, and replaced by some other tool. S3Guard can't tell the
difference between the new file and delete / list inconsistency and continues to treat the
file as deleted.
* An S3Guard-ed file is overwritten by a longer file by some other tool. When reading the
file, only the length of the original file is read.

We could possibly have smarter behavior here by querying both S3 and the MetadataStore (even
in cases where we may currently only query the MetadataStore in getFileStatus) and use whichever
one has the higher modified time.

This kills the performance boost we currently get in some workloads with the short-circuited
getFileStatus, but we could keep it with authoritative mode which should give a larger performance
boost. At least we'd get more correctness without authoritative mode and a clear declaration
of when we can make the assumptions required to short-circuit the process. If we can't consider
S3Guard the source of truth, we need to defer to S3 more.

We'd need to be extra sure of any locality / time zone issues if we start relying on mod_time
more directly, but currently we're tracking the modification time as returned by S3 anyway.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

View raw message