hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14965) s3a input stream "normal" fadvise mode to be adaptive
Date Mon, 23 Oct 2017 20:35:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215788#comment-16215788

ASF GitHub Bot commented on HADOOP-14965:

GitHub user steveloughran opened a pull request:


    HADOOP-14965 s3a input stream "normal" fadvise mode to be adaptive

    This makes the {{S3AInputStream.inputPolicy}} non-final, and on the first backwards seek
on a Normal input, switches it to Random (logging @ info in the process). If seeks are forward(),
it just skips forwards, as sequential input does.
    The input stream instrumentation counts the #of times the policy was changed (including
the first), and the current value, where it is picked up in tests (so there's no need to add
a test accessor as an input stream feature itself). 
    The test {{ITestS3AInputStreamPerformance.testRandomIONormalPolicy}} broke as the instrumentation
showed only 1 TCP abort, not 4. This is a success, as it shows the policy is adapting.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/hadoop s3/HADOOP-14965-adaptive-seek

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #283
commit c581840a800dd22372c4a2b78c3ce5c2da2fd3fe
Author: Steve Loughran <stevel@apache.org>
Date:   2017-10-23T20:27:00Z

    HADOOP-14965 patch 001: the "normal" input policy switches from sequential to random IO
    Change-Id: I95459f063b5da973619334bacae7fd89953e1bec


> s3a input stream "normal" fadvise mode to be adaptive
> -----------------------------------------------------
>                 Key: HADOOP-14965
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14965
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
> HADOOP-14535 added seek optimisation to wasb, but rather than require the caller to declare
sequential vs random, it works out for itself.
> # defaults to sequential, lazy seek
> # if the caller ever seeks backwards, switches to random IO.
> This means that on the use pattern of columnar stores: of go to end of file, read summary,
then go to columns and work forwards, will switch to random IO after that first seek back
(cost: one aborted HTTP connection)/.
> Where this should benefit the most is in downstream apps where you are working with different
data sources in the same object store/running of the same app config, but have different read
patterns. I'm seeing exactly this in some of my spark tests, where it's near impossible to
set things up so that .gz files are read sequentially, but ORC data is read in random IO
> I propose the "normal" fadvise => adaptive, sequential==sequential always, random
=> random from the outset.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message