From user-return-213-archive-asf-public=cust-asf.ponee.io@orc.apache.org Tue Mar 27 22:32:31 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0B69818064E for ; Tue, 27 Mar 2018 22:32:30 +0200 (CEST) Received: (qmail 33823 invoked by uid 500); 27 Mar 2018 20:32:30 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 33803 invoked by uid 99); 27 Mar 2018 20:32:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2018 20:32:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0C20D180446; Tue, 27 Mar 2018 20:32:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id PvVYSY4JNw6F; Tue, 27 Mar 2018 20:32:27 +0000 (UTC) Received: from mail-ot0-f170.google.com (mail-ot0-f170.google.com [74.125.82.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 02EFD5F36B; Tue, 27 Mar 2018 20:32:27 +0000 (UTC) Received: by mail-ot0-f170.google.com with SMTP id i28-v6so229659otf.8; Tue, 27 Mar 2018 13:32:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=/nfzYRQl47Vn9405M40SsBATLtTJGvAwoKdZlcgHSIk=; b=m/+QKi/8YcLsjRIfN7zjReigBUz+MgDbBXodFiQHW2dS6GDF9j1v/eqS3TMxt5UQ9G S1JS6yo9zS5F5r6mcpzsFAwxirX+/NVHJzVnBOtrP+Ch+ZVP0MggaA5LAsuT2nRwQQJq mYqmq/Gn+IJryVFrnQ0T1YI7ZrHmcfAENi+mYB/zAN36JRq6i3UiXBhQPYly9WguVArB ShI2Rgw1VLb+rEFgLVYA3ft7rAM6cWD2R9gJpKWG5Jubnu1qPeCRiqiuR+Q8OLjLNQ8o bHjEay0/2L9X6HP0WTmLQIbBhj4ZeJ+8PPOf4FahDDos/Nx6DRstXHYDyXzjAGlwZcub ScqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=/nfzYRQl47Vn9405M40SsBATLtTJGvAwoKdZlcgHSIk=; b=FQl6iyiUcSm187mqujhr/sfRI/HxprD7EFONHVrcKihlrnQbhSHToEYHwFIcKfgngh wus/QSC3tugPVox9gv9V8jsQsuZTtuGJBrUQtQqCxrOVG2pRbFINVxEtyD9YfCvkbKY6 7zmcEgB0YTfD5XyOvrIrZdTWogtYaUEK+KC8xLFEpfL8E/ffHfYkC7p2Tfoa71yYoX40 QsgCJlR7zmC+dqtgJfcWo6W5hdt4aLKvmWRL/PU99rWGqzYKmU+jdGymy15LfnVjJ52Y lEJN3qg1dspI8ntOQa3Sz9dAaNu5yKuPjaDwW1lSEhtjcnGzeEF5IdAMI7Ptkn+uh2eT uKYw== X-Gm-Message-State: AElRT7FY+NFOmG/ETecCofHqYqh6GMjg3BiPTjBPab1laRSJUhkbU15t bLpo6IlnALdWI7R/pNaxC3WaXL/ryfZsKAnA45o/IvmZ X-Google-Smtp-Source: AIpwx49T89AOeM7TwLQsH/taeUqYv3bLM3KQM/bYYrs7azMQ+zDDxIdUSft++wnvCLQT2rpscqSHsjbIIcNvWRbnIXY= X-Received: by 2002:a9d:1150:: with SMTP id p16-v6mr509655otp.209.1522182745227; Tue, 27 Mar 2018 13:32:25 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:155b:0:0:0:0:0 with HTTP; Tue, 27 Mar 2018 13:32:24 -0700 (PDT) In-Reply-To: <2BE544BB-09A7-4323-9895-02F40C6FDFC6@hortonworks.com> References: <17B91B6B0D9BBC44A1682DABC201C53552055763@SHSMSX104.ccr.corp.intel.com> <2BE544BB-09A7-4323-9895-02F40C6FDFC6@hortonworks.com> From: "Owen O'Malley" Date: Tue, 27 Mar 2018 13:32:24 -0700 Message-ID: Subject: Re: ORC double encoding optimization proposal To: user@orc.apache.org Cc: "dev@orc.apache.org" Content-Type: multipart/alternative; boundary="000000000000d0072f05686ac9c4" --000000000000d0072f05686ac9c4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Going back to the point of double split encoding, it would make sense to try a variant where we combine the sign and the mantissa. That should remove the sign stream at a relatively little cost of making the mantissa stream signed. Thinking more about the layout options... Another consideration is that we'd be better off not splitting the compression chunks between ranges and yet I'm worried about the overhead of closing all of the compression chunks and rle runs early. So we could modify my #2 proposal to be sensitive to rle and compression chunks. If at the end of the row group, we wait until the rle and compression chunks close and interleave the streams. That means that for a column with three streams and two row groups, we could something like: stream1.1, stream2.1, stream3.1, stream1.2, stream2.2, stream3.2 stream x.y contains a whole number of compression chunks and the majority of the data for row group X is in the stream *.X. This significantly improves the current state of affairs because now we know that if we read stream *.1, we'll have the entire first row group and can start decompression and processing while we read the other "stripelets". By not forcing the closure of the rle and compression, we have preserved the compression and yet gained the ability to have async io in the reader. .. Owen On Sun, Mar 25, 2018 at 11:47 PM, Gopal Vijayaraghavan wrote: > > > 2. Under seek or predicate pushdown scenario, there=E2=80=99s no nee= d to load > the entire stream. > > Yes, that is a valid scenario where the reader reads partial-streams & > causes random IO. > > The current double encoding is actually 2 streams today & will continue t= o > use 2 streams for the FLIP implementation. > > The SPLIT implementation will go from the current 2 streams to 4 streams > (i.e 1+1->1+3 streams) & the total data IO will drop by ~2x or so. More s= o > if one of the streams can be suppressed (like in my IoT data-set, where t= he > sign-bit is always +ve for my electric meter data). > > The trade-offs seem to be working out on regular HDDs with locality & for > LLAP SSD caches - if your use-cases are different, I'd like to hear more > about it. > > The only significant random IO delays expected seem to be entirely within > the HDFS API network hops (which offers 0% locality when data is erasure > coded or for cloud-storage), which I hope to fix in the Hadoop-3.x branch > with a new API. > > Cheers, > Gopal > > > --000000000000d0072f05686ac9c4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Going back to the point of double split encoding, it would= make sense to try a variant where we combine the sign and the mantissa. Th= at should remove the sign stream at a relatively little cost of making the = mantissa stream signed.

Thinking more about the layout o= ptions...=C2=A0

Another consideration is that we&#= 39;d be better off not splitting the compression chunks between ranges and = yet I'm worried about the overhead of closing all of the compression ch= unks and rle runs early.

So we could modify my #2 = proposal to be sensitive to rle and compression chunks. If at the end of th= e row group, we wait until the rle and compression chunks close and interle= ave the streams. That means that for a column with three streams and two ro= w groups, we could something like:

stream1.1, stre= am2.1, stream3.1, stream1.2, stream2.2, stream3.2

= stream x.y contains a whole number of compression chunks and the majority o= f the data for row group X is in the stream *.X. This significantly improve= s the current state of affairs because now we know that if we read stream *= .1, we'll have the entire first row group and can start decompression a= nd processing while we read the other "stripelets".
By not forcing the closure of the rle and compression, we have = preserved the compression and yet gained the ability to have async io in th= e reader.

.. Owen


On Sun, Mar 25, 2018 at= 11:47 PM, Gopal Vijayaraghavan <gopalv@apache.org> wrote:

>=C2=A0 =C2=A0 2. Under seek or predicate pushdown scenario, there=E2=80= =99s no need to load the entire stream.

Yes, that is a valid scenario where the reader reads partial-streams= & causes random IO.

The current double encoding is actually 2 streams today & will continue= to use 2 streams for the FLIP implementation.

The SPLIT implementation will go from the current 2 streams to 4 streams (i= .e 1+1->1+3 streams) & the total data IO will drop by ~2x or so. Mor= e so if one of the streams can be suppressed (like in my IoT data-set, wher= e the sign-bit is always +ve for my electric meter data).

The trade-offs seem to be working out on regular HDDs with locality & f= or LLAP SSD caches - if your use-cases are different, I'd like to hear = more about it.

The only significant random IO delays expected seem to be entirely within t= he HDFS API network hops (which offers 0% locality when data is erasure cod= ed or for cloud-storage), which I hope to fix in the Hadoop-3.x branch with= a new API.

Cheers,
Gopal



--000000000000d0072f05686ac9c4--