hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15541) AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
Date Fri, 15 Jun 2018 09:41:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513608#comment-16513608
] 

Steve Loughran commented on HADOOP-15541:
-----------------------------------------

I've worried about something related to this for a while, precisely because we are using close()
not abort. Assuming the error on read() is due to a network problem, breaking that whole TCP
connection is the only way to guarantee that your followup GET isn't on the same HTTP1.1
stream.

I wasn't too worried, on the basis that nobody had complained...clearly that's not true any
more. And my expectation of how things would fail was worse. 

Here's one possible strategy

# {{S3AInputStream.reopen()}}  adds a {{boolean forceAbort}} param; passes it in to {{closeStream}};

# {{S3AInputStream.onReadFailure()}} forces that abort.

Like you say, no real point in not aborting here.

Happy for a patch, I don't think we can test this easily so not expecting any tests in the
patch...

> AWS SDK can mistake stream timeouts for EOF and throw SdkClientExceptions
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-15541
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15541
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.9.1, 2.8.4, 3.0.2, 3.1.1
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>            Priority: Major
>
> I've gotten a few reports of read timeouts not being handled properly in some Impala
workloads. What happens is the following sequence of events (credit to Sailesh Mukil for figuring
this out):
>  * S3AInputStream.read() gets a SocketTimeoutException when it calls wrappedStream.read()
>  * This is handled by onReadFailure -> reopen -> closeStream. When we try to drain
the stream, SdkFilterInputStream.read() in the AWS SDK fails because of checkLength. The underlying
Apache Commons stream returns -1 in the case of a timeout, and EOF.
>  * The SDK assumes the -1 signifies an EOF, so assumes the bytes read must equal expected
bytes, and because they don't (because it's a timeout and not an EOF) it throws an SdkClientException.
> This is tricky to test for without a ton of mocking of AWS SDK internals, because you
have to get into this conflicting state where the SDK has only read a subset of the expected
bytes and gets a -1.
> closeStream will abort the stream in the event of an IOException when draining. We could
simply also abort in the event of an SdkClientException. I'm testing that this results in
correct functionality in the workloads that seem to hit these timeouts a lot, but all the
s3a tests continue to work with that change. I'm going to open an issue with the AWS SDK Github
as well, but I'm not sure what the ideal outcome would be unless there's a good way to distinguish
between a stream that has timed out and a stream that read all the data without huge rewrites.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message