hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14971) Merge S3A committers into trunk
Date Tue, 21 Nov 2017 00:30:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260083#comment-16260083
] 

ASF GitHub Bot commented on HADOOP-14971:
-----------------------------------------

Github user ajfabbri commented on a diff in the pull request:

    https://github.com/apache/hadoop/pull/282#discussion_r152149693
  
    --- Diff: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ---
    @@ -871,6 +871,166 @@ options are covered in [Testing](./testing.md).
     </property>
     ```
     
    +## <a name="retry_and_recovery"></a>Retry and Recovery
    +
    +The S3A client makes a best-effort attempt at recovering from network failures;
    +this section covers the details of what it does. 
    +
    +The S3A divides exceptions returned by the AWS SDK into different categories,
    +and chooses a differnt retry policy based on their type and whether or
    +not the failing operation is idempotent.
    +
    + 
    +### Unrecoverable Problems: Fail Fast
    +
    +* No object/bucket store: `FileNotFoundException`
    +* No access permissions: `AccessDeniedException`
    +* Network errors considered unrecoverable (`UnknownHostException`,
    + `NoRouteToHostException`, `AWSRedirectException`).
    +* Interruptions: `InterruptedIOException`, `InterruptedException`.
    +* Rejected HTTP requests: `InvalidRequestException`
    +
    +These are all considered unrecoverable: S3A will make no attempt to recover
    +from them.
    +
    +### Possibly Recoverable Problems: Retry
    +
    +* Connection timeout: `ConnectTimeoutException`. Timeout before
    +setting up a connection to the S3 endpoint (or proxy).
    +* HTTP response status code 400, "Bad Request"
    +
    +The status code 400, Bad Request usually means that the request
    +is unrecoverable; it's the generic "No" response. Very rarely it
    +does recover, which is why it is in this category, rather than that
    +of unrecoverable failures. 
    +
    +These failures will be retried with a fixed sleep interval set in
    +`fs.s3a.retry.interval`, up to the limit set in `fs.s3a.retry.limit`.
    +
    +
    +### Only retrible on idempotent operations
    +
    +Some network failures are considered to be retriable if they occur on
    +idempotent operations; there's no way to know if they happened
    +after the request was processed by S3.
    +
    +* `SocketTimeoutException`: general network failure.
    +* `EOFException` : the connection was broken while reading data
    +* "No response from Server" (443, 444) HTTP responses.
    +* Any other AWS client, service or S3 exception. 
    +
    +These failures will be retried with a fixed sleep interval set in
    +`fs.s3a.retry.interval`, up to the limit set in `fs.s3a.retry.limit`.
    +
    +*Important*: DELETE is considered idempotent, hence: `FileSystem.delete()`
    +and `FileSystem.rename()` will retry their delete requests on any
    +of these failures.
    +
    +The issue of whether delete should be idempotent has been a source
    +of historical controversy in Hadoop.
    +
    +1. In the absence of any other changes to the object store, a repeated
    +DELETE request will eventually result in the named object being deleted;
    +it's a no-op if reprocessed. As indeed, is `Filesystem.delete()`.
    +1. If another client creates a file under the path, it will be deleted.
    +1. Any filesystem supporting an atomic `FileSystem.create(path, overwrite=false)`
    +operation to reject file creation if the path exists MUST NOT consider
    +delete to be idempotent, because a `create(path, false)` operation will
    +only succeed if the first `delete()` call has already succeded.
    +1. And a second, retried `delete()` call could delete the new data.
    +
    +Because S3 is eventially consistent *and* doesn't support an
    +atomic create-no-overwrite operation, the choice is more ambigious.
    +
    +Currently S3A considers delete to be
    +idempotent because it is convenient for many workflows, including the
    +commit protocols. Just be aware that in the presence of transient failures,
    +more things may be deleted than expected. (For anyone who considers this to
    +be the wrong decision: rebuild the `hadoop-aws` module with the constant
    +`S3AFileSystem.DELETE_CONSIDERED_IDEMPOTENT` set to `false`).
    +
    +
    +
    + 
    +
    +
    +### Throttled requests from S3 and Dynamo DB
    +
    +
    +When S3A or Dynamo DB returns a response indicating that requests
    +from the caller are being throttled, an exponential back-off with 
    +an initial interval and a maximum number of requests.
    +
    +```xml
    +<property>
    +  <name>fs.s3a.retry.throttle.limit</name>
    +  <value>${fs.s3a.attempts.maximum}</value>
    +  <description>
    +    Number of times to retry any throttled request.
    +  </description>
    +</property>
    +
    +<property>
    +  <name>fs.s3a.retry.throttle.interval</name>
    +  <value>1000ms</value>
    +  <description>
    +    Interval between retry attempts on throttled requests.
    +  </description>
    +</property>
    +```
    +
    +Notes
    +
    +1. There is also throttling taking place inside the AWS SDK; this is managed
    +by the value `fs.s3a.attempts.maximum`.
    +1. Throttling events are tracked in the S3A filesystem metrics and statistics.
    +1. Amazon KMS may thottle a customer based on the total rate of uses of 
    +KMS *across all user accounts and applications*.
    +
    +Throttling of S3 requests is all too common; it is caused by too many clients
    +trying to access the same shard of S3 Storage. This generatlly
    +happen if there are too many reads, those being the most common in Hadoop
    +applications. This problem is exacerbated by Hive's partitioning
    --- End diff --
    
    /generatlly happen/generally happens/


> Merge S3A committers into trunk
> -------------------------------
>
>                 Key: HADOOP-14971
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14971
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.0.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13786-040.patch, HADOOP-13786-041.patch
>
>
> Merge the HADOOP-13786 committer into trunk. This branch is being set up as a github
PR for review there & to keep it out the mailboxes of the watchers on the main JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message