hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Hitchcock (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6254) s3n fails with SocketTimeoutException
Date Fri, 18 Sep 2009 00:58:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756889#action_12756889

Andrew Hitchcock commented on HADOOP-6254:

Hi Tom,

The problem with changing the socket read timeout is Hadoop tasks can process at an arbitrarily
rate, which means that mapper input data from Amazon S3 can be read at an arbitrary rate.
There are two timeouts you can hit with Amazon S3 if you leave a socket open for long enough
without pulling any data from it:

* You can hit a client side timeout, which is configurable, and appears as a SocketTimeoutException
* You can hit an Amazon S3 server side timeout, which is not configurable, and appears as
a SocketException("Connection reset by peer").

Just increasing the client side timeout has 4 problems:

1. Increasing timeouts will keep the connection open longer, whereas what we're trying to
do is give up the connection after a reasonable timeout, but then reopen it when we need it
again. This way we're playing nicer with various system resources.
2. No matter what we put it at, one can imagine a task pulling data slower, and so encountering
this exception
3. There is some value of the client side timeout above which all that happens is that we
get a server side timeout instead
4. As a generalization you don't want client socket timeouts to be too big because it is always
possible for a server to get "stuck" and stop sending data, in which case you want to recognize
this failure in a timely manner via the timeout. (Not that Amazon S3 is known to have any
such issues, but its best to be defensive in error handling).
Thus I now think the best solution is:

* Catch all IOExceptions and then retry once
* Keep the socket timeout at 60 seconds as it seems a reasonable trade-off between the cost
of holding a connection open and the cost of reestablishing the connection.

I'll prepare a new patch.

> s3n fails with SocketTimeoutException
> -------------------------------------
>                 Key: HADOOP-6254
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6254
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.18.3, 0.19.2, 0.20.1
>            Reporter: Andrew Hitchcock
>            Assignee: Andrew Hitchcock
>         Attachments: HADOOP-6254.diff
> If a user's map function is CPU intensive and doesn't read from the input very quickly,
compounded by the buffering of input, then S3 might think the connection has been lost and
will close the connection. Then when the user attempts to read from the input again, they'll
receive a SocketTimeoutException and the task will fail.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message