hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-13654) S3A create() to support asynchronous check of dest & parent paths
Date Sun, 25 Sep 2016 15:24:20 GMT
Steve Loughran created HADOOP-13654:

             Summary: S3A create() to support asynchronous check of dest & parent paths
                 Key: HADOOP-13654
                 URL: https://issues.apache.org/jira/browse/HADOOP-13654
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
    Affects Versions: 2.7.3
            Reporter: Steve Loughran

One source of delays in S3A is the need to check if a destination path exists in create; this
makes sure the operation isn't trying to overwrite a directory.

#. This is slow, 1-4 HTTPS requests
# The code doesn't seem to check the entire parent path to make sure there isn't a file as
a parent (which raises the question: shouldn't we have a contract test for this?)
# Even with the create overwrite=false check, the fact that the new object isn't created until
the output stream is close()'d, means that the check has race conditions.

Instead of doing a synchronous check in create(), we could do an asynchronous check of the
parent directory tree. If any error surfaced, this could be cached and then thrown on the
next call to: write(), flush() or close(); that is, the failure of a create due to path problems
would not surface immediately on the create() call, *but before any writes were committed*.

The full directory tree can/should be checked, and is results remembered. This would allow
for the post-commit cleanup to issue delete() requests purely for those paths (if any) which
referred to directories.

As well as the need to use the AWS thread pool, there's a bit of complexity with cancelling
multipart uploads: the output stream needs to know that the request failed, and that the multipart
should be aborted.

If the complexity of the asynchronous calls can be coped with, *and client code happy to accept
errors in the any IO call to the output stream*, then the initial overhead at file creation
could be skipped.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org

View raw message