From dev-return-1970-archive-asf-public=cust-asf.ponee.io@orc.apache.org Mon Mar 26 06:59:19 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 759E2180671 for ; Mon, 26 Mar 2018 06:59:18 +0200 (CEST) Received: (qmail 16989 invoked by uid 500); 26 Mar 2018 04:59:16 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 16966 invoked by uid 99); 26 Mar 2018 04:59:16 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2018 04:59:16 +0000 Received: from [10.42.80.96] (outbound.hortonworks.com [192.175.27.2]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 33D8A1D7; Mon, 26 Mar 2018 04:59:14 +0000 (UTC) User-Agent: Microsoft-MacOutlook/10.b.0.180311 Date: Sun, 25 Mar 2018 21:59:06 -0700 Subject: Re: ORC double encoding optimization proposal From: Gopal Vijayaraghavan To: Xiening Dai , "dev@orc.apache.org" , "user@orc.apache.org" Message-ID: Thread-Topic: ORC double encoding optimization proposal References: <17B91B6B0D9BBC44A1682DABC201C53552055763@SHSMSX104.ccr.corp.intel.com> In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: 7bit Hi, > Since Split creates two separated streams, reading one data batch will need an additional seek in order to reconstruct the column data If you are seeing a seek like that, we've messed up something else higher up in the pipeline & that can be fixed. ORC columnar reads only do random IO at the column level, not the stream level (except for non-column streams like the bloom filters) - adjacent streams are read together as a single IO op. DiskRangeList produce a merged read plan before firing off any read, so the actual IO layer will (or should) never a seek between adjacent streams. There's a possibility that someone will add an extra byte or something to a stream which they do not read ever, which might be a problem. In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, which performs very poorly if you add irrelevant seeks. If you do find a similar case in Apache ORC (not Hive-orc), I'll file a corresponding ticket to this https://issues.apache.org/jira/browse/HIVE-13161 That was actually about reading 2 columns with an entirely NULL column in the middle, not exactly about splitting streams. The next giant leap of IO performance for seeks is expected from a new HDFS API, which allows for the scatter-gather to be pushed-down further into the IO layer. https://issues.apache.org/jira/browse/HADOOP-11867 This mainly intended for reading ORC files from Erasure coded streams, where the IO layer can reorganize and align the reads along the Erasure Coding boundaries (not so much about actual IOPs), instead of assuming normal read-ahead for the block reader. Cheers, Gopal