hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HADOOP-13654) S3A create() to support asynchronous check of dest & parent paths
Date Mon, 04 Sep 2017 14:35:03 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran resolved HADOOP-13654.
    Resolution: Won't Fix

> S3A create() to support asynchronous check of dest & parent paths
> -----------------------------------------------------------------
>                 Key: HADOOP-13654
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13654
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
> One source of delays in S3A is the need to check if a destination path exists in create;
this makes sure the operation isn't trying to overwrite a directory.
> #. This is slow, 1-4 HTTPS requests
> # The code doesn't seem to check the entire parent path to make sure there isn't a file
as a parent (which raises the question: shouldn't we have a contract test for this?)
> # Even with the create overwrite=false check, the fact that the new object isn't created
until the output stream is close()'d, means that the check has race conditions.
> Instead of doing a synchronous check in create(), we could do an asynchronous check of
the parent directory tree. If any error surfaced, this could be cached and then thrown on
the next call to: write(), flush() or close(); that is, the failure of a create due to path
problems would not surface immediately on the create() call, *but before any writes were committed*.
> The full directory tree can/should be checked, and is results remembered. This would
allow for the post-commit cleanup to issue delete() requests purely for those paths (if any)
which referred to directories.
> As well as the need to use the AWS thread pool, there's a bit of complexity with cancelling
multipart uploads: the output stream needs to know that the request failed, and that the multipart
should be aborted.
> If the complexity of the asynchronous calls can be coped with, *and client code happy
to accept errors in the any IO call to the output stream*, then the initial overhead at file
creation could be skipped.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message