Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7A7BD200D28 for ; Mon, 23 Oct 2017 15:23:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 794ED160BF0; Mon, 23 Oct 2017 13:23:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BEF461609CE for ; Mon, 23 Oct 2017 15:23:08 +0200 (CEST) Received: (qmail 93066 invoked by uid 500); 23 Oct 2017 13:23:07 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 93054 invoked by uid 99); 23 Oct 2017 13:23:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Oct 2017 13:23:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 02A0418063E for ; Mon, 23 Oct 2017 13:23:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id bvCFCHCqaNYr for ; Mon, 23 Oct 2017 13:23:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 170E460DC4 for ; Mon, 23 Oct 2017 13:23:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9266CE069F for ; Mon, 23 Oct 2017 13:23:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A16E824360 for ; Mon, 23 Oct 2017 13:23:00 +0000 (UTC) Date: Mon, 23 Oct 2017 13:23:00 +0000 (UTC) From: "Steve Loughran (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-14965) s3a input stream "normal" fadvise mode to be adaptive MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 23 Oct 2017 13:23:09 -0000 [ https://issues.apache.org/jira/browse/HADOOP-14965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215123#comment-16215123 ] Steve Loughran commented on HADOOP-14965: ----------------------------------------- It won't be very adaptive: I'm thinking: sequential & random do as expected. "normal" will behave as sequential until the first backwards seek, at which point it -> random IO, on the basis that it is clearly not doing sequential access. Once in random mode, you stay there. I'm also thinking we need to make setting policies something which can be passed all the way down from, say, spark queries down to the FS client, which is not impossible given that there is the ability to set options on a read and have them passed down to the reader factory. That just needs to (somehow) be passed down to the FS. Presumably we'd need to add an FSDataInputStreamBuilder alongside the new output stream builder, define a standard set of key values for seek policy (hints) & implement in those store clients where cost of seek is high (the objects stores). Not on my immediate TODO this there. > s3a input stream "normal" fadvise mode to be adaptive > ----------------------------------------------------- > > Key: HADOOP-14965 > URL: https://issues.apache.org/jira/browse/HADOOP-14965 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Reporter: Steve Loughran > > HADOOP-14535 added seek optimisation to wasb, but rather than require the caller to declare sequential vs random, it works out for itself. > # defaults to sequential, lazy seek > # if the caller ever seeks backwards, switches to random IO. > This means that on the use pattern of columnar stores: of go to end of file, read summary, then go to columns and work forwards, will switch to random IO after that first seek back (cost: one aborted HTTP connection)/. > Where this should benefit the most is in downstream apps where you are working with different data sources in the same object store/running of the same app config, but have different read patterns. I'm seeing exactly this in some of my spark tests, where it's near impossible to set things up so that .gz files are read sequentially, but ORC data is read in random IO > I propose the "normal" fadvise => adaptive, sequential==sequential always, random => random from the outset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org