From user-return-211-archive-asf-public=cust-asf.ponee.io@orc.apache.org Tue Mar 27 07:39:20 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 45943180671 for ; Tue, 27 Mar 2018 07:39:20 +0200 (CEST) Received: (qmail 44395 invoked by uid 500); 27 Mar 2018 05:39:19 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 44381 invoked by uid 99); 27 Mar 2018 05:39:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2018 05:39:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C9583C2DC8 for ; Tue, 27 Mar 2018 05:39:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=iq80-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zugGWRJ_lOED for ; Tue, 27 Mar 2018 05:39:15 +0000 (UTC) Received: from mail-pl0-f44.google.com (mail-pl0-f44.google.com [209.85.160.44]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 82CD95F3FF for ; Tue, 27 Mar 2018 05:39:14 +0000 (UTC) Received: by mail-pl0-f44.google.com with SMTP id w12-v6so13471927plp.4 for ; Mon, 26 Mar 2018 22:39:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iq80-com.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=NIIdRWllisbT12zsguhmI3Qhupn3pNnjlDiFiON2oHc=; b=l5zC5fHGcm0CdA4rP34jw854CQtKa02GcTq3sa0D/oTRn6DF253q6FBbYWVXOLcWKv fcqSdsg4eQnR4KlovEAq6u83XPzUf8zBCGMiFIMkQuHLnKF8ipZI8iQ/n+NIzo6Y+7F4 zrQ+hPrqLtSv1AUaWsJ3/7nusUkrp2SksKmgaW3J6E3hwMjmejB7ryojmOYVHisEpT2E YjaVbp7SSLisYiacjpm+qsXiR83vCoQArxEogAnhNkaGDq0Q1zQCNLrzzOFZR1QOUIjp FQWtw4ZlSEs4hem0FMm/YVSDrXzq/L0FAFjuOEA9VX3/MhdXycS552pCKxqsFkwQGaZq QyHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=NIIdRWllisbT12zsguhmI3Qhupn3pNnjlDiFiON2oHc=; b=MUPPjpDvSeKU53PeVxfOvJLkv5H8jaZt9VjWtFWrRyq+O8oOsK4TasSi/Y4P2BKkcV 6Wc47oydoD7VzZletgT5ZjiXj5J7znWb1/l8f1DAZ2/XSGfyGNkJNF5nUFtTSHBF7l0I SoYzManiwuUsUzXcGdgxc6mmZuMFsWWFm2I7aLGQ4+pnB7gAQiH1xh8NpG4vpoC2YSvp T652ZjmPt/ngCYetApuutnPfbLytXU2XS33fzOTjlCHo9b2jMclV/GFe/e5I433CByRu B8y/I6BOxhxlzFdH/GN4pCYlBMHRv4Yqq1DxKkA54gfPwHY1yC7CkVe9TSOfs/qkr1xN azMA== X-Gm-Message-State: AElRT7FgDkbrEbt4M+/ymof7uhEttX2rXJU701/EcJ9bNGJREolSJtAs W3QEZQZ62GkZl8ka6Je5UqdB9jya5I8= X-Google-Smtp-Source: AG47ELurA08txNTEJHm7ctVSqO+F4GhbbX0HRL6dgEXOS9uwyst8YJ/C7FAhEz10C9af5rEbxUOrHQ== X-Received: by 2002:a17:902:8697:: with SMTP id g23-v6mr44803889plo.393.1522129152715; Mon, 26 Mar 2018 22:39:12 -0700 (PDT) Received: from ?IPv6:2620:10d:c081:1132::11c9? ([2620:10d:c090:180::1:eeb3]) by smtp.gmail.com with ESMTPSA id b13sm827871pfi.169.2018.03.26.22.39.10 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 26 Mar 2018 22:39:11 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: ORC double encoding optimization proposal From: Dain Sundstrom In-Reply-To: Date: Mon, 26 Mar 2018 22:39:08 -0700 Cc: "user@orc.apache.org" Content-Transfer-Encoding: quoted-printable Message-Id: <36552536-FF2A-4731-ABB6-92C10C37F5A5@iq80.com> References: <17B91B6B0D9BBC44A1682DABC201C53552055763@SHSMSX104.ccr.corp.intel.com> <17B91B6B0D9BBC44A1682DABC201C5355205EF6D@SHSMSX104.ccr.corp.intel.com> <8E7B3A58-2FFD-4336-A6FA-B79C3E3E851D@hortonworks.com> <6D664A64-6755-40B8-A389-E3D1BD876EB6@iq80.com> To: dev@orc.apache.org X-Mailer: Apple Mail (2.3445.5.20) On Mar 26, 2018, at 8:19 PM, Xiening Dai wrote: >=20 > But that approach still doesn=E2=80=99t help when one column has = multiple large streams. Let=E2=80=99s say we have two streams and each = one is 50M in size. With current reader implementation, we read 4M chunk = every time from each stream, and requires a seek since the chunks are = 50M apart. Alternatively we can read both streams with sequential IO, = but we would end up holding the 100M compressed data in memory, which is = not an effective use of reader memory. Note that this problem exists = even without predicate pushdown. I recently tuned the IO strategy in our implementations, and when you = work out the math the performance advantage of large IOs falls off very = quickly once you get to a couple of megabytes. This is because the = transfer time starts to dominate over the seeks, so we also put a max = size on read sizes to keep buffer memory lower. =20 For us two sequential large streams take twice the buffer memory, but = the IO cost is effectively the same. Where we would run into problems = in small streams/columns between large columns, since there is no = potential for shared reads. -dain=