hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13203) S3a: Consider reducing the number of connection aborts by setting correct length in s3 request
Date Wed, 15 Jun 2016 10:22:09 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rajesh Balamohan updated HADOOP-13203:
--------------------------------------
    Attachment: stream_stats.tar.gz
                HADOOP-13203-branch-2-004.patch

There is a corner case, wherein closing the stream should make use of {{requestedStreamLen}}
instead of {{contentLength}} to avoid connection abort. This would be visible in long running
services in the cluster tries to access this codepath. Fixed this in the latest patch.

Also, got the stream access profiles for couple of TPC-DS and TPC-H queries, wherein I printed
the stream statistics during close in the cluster where i tested it. Attaching those logs
here with. Please note that this was done with ORC data format which tries to read the footer
and then starts reading the stripe information.

1. In TPC-DS most of the files are small so they end up having single backwards seeks during
file reading. I.e Reader reads
the postscript/footer/meta details as the first operation and then seeks backwards to read
the data portion of the file. Without the patch, it would abort the connection as the difference
between file length and the current position would be much higher than CLOSE_THRESHOLD.  

e.g log
{noformat}2016-06-15 09:00:31,546 [INFO] [TezChild] |s3a.S3AFileSystem|: S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450967/000456_0
pos=4162453 nextReadPos=4162453 contentLength=7630589 StreamStatistics{OpenOperations=4, CloseOperations=4,
Closed=4, Aborted=0, SeekOperations=3, ReadExceptions=0, ForwardSeekOperations=2, BackwardSeekOperations=1,
BytesSkippedOnSeek=5963, BytesBackwardsOnSeek=7629525, BytesRead=740946, BytesRead excluding
skipped=734983, ReadOperations=91, ReadFullyOperations=0, ReadsIncomplete=85}}
{noformat}

There are file accesses without any backward seeks, where in they access standard 16KB information
to read the footer details and closes the file without any additional reads. 
e.g log
{noformat}
2016-06-15 09:00:28,590 [INFO] [TezChild] |s3a.S3AFileSystem|: S3AInputStream{s3a://xyz/tpcds_bin_partitioned_orc_200.db/store_sales/ss_sold_date_sk=2450993/000213_0
pos=7549954 nextReadPos=7549954 contentLength=7549954 StreamStatistics{OpenOperations=1, CloseOperations=1,
Closed=1, Aborted=0, SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, BackwardSeekOperations=0,
BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, BytesRead=16384, BytesRead excluding skipped=16384,
ReadOperations=1, ReadFullyOperations=0, ReadsIncomplete=0}}
{noformat}

2. In TPC-H dataset, relatively large files are present (e.g each file in lineitem dataset
would be around 1 GB in size in the overall 1 TB tpc-h dataset). In such cases, equal amount
of forward-seeks and backward-seeks happen (e.g around 24 times in per file in the log). Patch
avoids connection aborts with backward seeks. 
e.g log
{noformat}
2016-06-15 09:26:26,671 [INFO] [TezChild] |s3a.S3AFileSystem|: S3AInputStream{s3a://xyz/tpch_flat_orc_1000.db/lineitem/000041_0
pos=728756230 nextReadPos=728756230 contentLength=739566852 StreamStatistics{OpenOperations=72,
CloseOperations=72, Closed=72, Aborted=0, SeekOperations=48, ReadExceptions=0, ForwardSeekOperations=24,
BackwardSeekOperations=24, BytesSkippedOnSeek=167662, BytesBackwardsOnSeek=737556392, BytesRead=244894978,
BytesRead excluding skipped=244727316, ReadOperations=28457, ReadFullyOperations=0, ReadsIncomplete=28217}}
{noformat}


> S3a: Consider reducing the number of connection aborts by setting correct length in s3
request
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13203
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13203
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13203-branch-2-001.patch, HADOOP-13203-branch-2-002.patch,
HADOOP-13203-branch-2-003.patch, HADOOP-13203-branch-2-004.patch, stream_stats.tar.gz
>
>
> Currently file's "contentLength" is set as the "requestedStreamLen", when invoking S3AInputStream::reopen().
 As a part of lazySeek(), sometimes the stream had to be closed and reopened. But lots of
times the stream was closed with abort() causing the internal http connection to be unusable.
This incurs lots of connection establishment cost in some jobs.  It would be good to set the
correct value for the stream length to avoid connection aborts. 
> I will post the patch once aws tests passes in my machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message