hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13203) S3a: Consider reducing the number of connection aborts by setting correct length in s3 request
Date Wed, 25 May 2016 10:13:12 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Rajesh Balamohan updated HADOOP-13203:
    Attachment: HADOOP-13203-branch-2-001.patch

Yes [~steve_l]. In workloads like hive, there are lots of random seeks and lots of times the
internal connection had to be aborted. It was a lot cheaper to reuse the connection with this
patch.  Amount of data to be requested for in the request can be determined by "Math.max(targetPos
+ readahead, (targetPos + length))".  

>From the unit tests perspective for aws, following issues were there

Test timeout failures:
- TestS3ADeleteManyFiles.testBulkRenameAndDelete
- org.apache.hadoop.fs.contract.s3a.TestS3AContractDistCp.largeFilesToRemote, largeFilesFromRemote
- org.apache.hadoop.fs.s3a.scale.TestS3ADeleteManyFiles.testBulkRenameAndDelete

Other failures
- org.apache.hadoop.fs.contract.s3a.TestS3AContractRootDir (Root directory operation rejected)
- This is already tracked in another jira.

- org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance.testReadAheadDefault/testReadBigBlocksBigReadahead
(earlier this expected 1 open, but now it can be multiple requestedStreamLen would no longer
be the file's length. At the max, we would be able to save a single read ahead call. For rest,
it has to open multiple times.
But this is ok compared with the connection restablishments in real workloads where it can
be completely random set of ranges being requested for. E.g hive.).  I have not updated the
patch to fix this failure. Based on inputs, I can revise the patch. 

> S3a: Consider reducing the number of connection aborts by setting correct length in s3
> ----------------------------------------------------------------------------------------------
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch
> Currently file's "contentLength" is set as the "requestedStreamLen", when invoking S3AInputStream::reopen().
 As a part of lazySeek(), sometimes the stream had to be closed and reopened. But lots of
times the stream was closed with abort() causing the internal http connection to be unusable.
This incurs lots of connection establishment cost in some jobs.  It would be good to set the
correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message