hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13203) S3a: Consider reducing the number of connection aborts by setting correct length in s3 request
Date Mon, 06 Jun 2016 19:50:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317086#comment-15317086

Steve Loughran commented on HADOOP-13203:

It looks like, as people note, the move may make forward seeking, or a mix of seek + read()
calls more expensive.More specifically, it could well accelerate a sequence of readFully()
offset calls, but not handle so well situations of ) + read(pos, n) + seek(pos + n + n2) ,
stuff the forward skipping could handle.

Even regarding readFully() calls, it isn't going to handle well any mix of read()+readFully(),
as the first read will have triggered a to-end-of-file read.

It seems to me that one could actually do something of both where all reads specified a block
length, such as 64KB. On sustained forward reads, when the boundary was triggered it'd read
forward. On mixed seek/read operations, ones where the range of the read is unknown, this
would significantly optimise any random access use, rather than those which exclusively used
on read operation.

And here's the problem: right now we don't know what are the API/file use modes in widespread
use against s3. We don't have the data. I can see what you're highlighting: the current mechanism
is very expensive for backwards seeks —but we have just optimised forward seeking *and*
instrumented the code to collect detail on what's actually going on.

# I don't want to rush into a change which has the potential to make some existing codepaths
worse —especially as we don't know how the FS gets used.
# I'd really like to see collected statistics on FS usage across a broad dataset. Anyone here
is welcome to contribute to this —it should include statistics gathered in downstream use.

I'm very tempted to argue this should be an S3a phase III improvement: it has ramifications,
and we should do it well. We are, with the metrics, in a position to understand those ramifications
and, if not in a rush, implement something which works well for a broad set of uses

> S3a: Consider reducing the number of connection aborts by setting correct length in s3
> ----------------------------------------------------------------------------------------------
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, HADOOP-13203-branch-2-002.patch,
> Currently file's "contentLength" is set as the "requestedStreamLen", when invoking S3AInputStream::reopen().
 As a part of lazySeek(), sometimes the stream had to be closed and reopened. But lots of
times the stream was closed with abort() causing the internal http connection to be unusable.
This incurs lots of connection establishment cost in some jobs.  It would be good to set the
correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message