From user-return-213-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Tue Mar 27 22:32:31 2018
Return-Path: <user-return-213-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 0B69818064E
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 27 Mar 2018 22:32:30 +0200 (CEST)
Received: (qmail 33823 invoked by uid 500); 27 Mar 2018 20:32:30 -0000
Mailing-List: contact user-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@orc.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@orc.apache.org>
List-Post: <mailto:user@orc.apache.org>
List-Id: <user.orc.apache.org>
Reply-To: user@orc.apache.org
Delivered-To: mailing list user@orc.apache.org
Received: (qmail 33803 invoked by uid 99); 27 Mar 2018 20:32:29 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Mar 2018 20:32:29 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0C20D180446;
	Tue, 27 Mar 2018 20:32:29 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.879
X-Spam-Level: *
X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id PvVYSY4JNw6F; Tue, 27 Mar 2018 20:32:27 +0000 (UTC)
Received: from mail-ot0-f170.google.com (mail-ot0-f170.google.com [74.125.82.170])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 02EFD5F36B;
	Tue, 27 Mar 2018 20:32:27 +0000 (UTC)
Received: by mail-ot0-f170.google.com with SMTP id i28-v6so229659otf.8;
        Tue, 27 Mar 2018 13:32:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=/nfzYRQl47Vn9405M40SsBATLtTJGvAwoKdZlcgHSIk=;
        b=m/+QKi/8YcLsjRIfN7zjReigBUz+MgDbBXodFiQHW2dS6GDF9j1v/eqS3TMxt5UQ9G
         S1JS6yo9zS5F5r6mcpzsFAwxirX+/NVHJzVnBOtrP+Ch+ZVP0MggaA5LAsuT2nRwQQJq
         mYqmq/Gn+IJryVFrnQ0T1YI7ZrHmcfAENi+mYB/zAN36JRq6i3UiXBhQPYly9WguVArB
         ShI2Rgw1VLb+rEFgLVYA3ft7rAM6cWD2R9gJpKWG5Jubnu1qPeCRiqiuR+Q8OLjLNQ8o
         bHjEay0/2L9X6HP0WTmLQIbBhj4ZeJ+8PPOf4FahDDos/Nx6DRstXHYDyXzjAGlwZcub
         ScqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=/nfzYRQl47Vn9405M40SsBATLtTJGvAwoKdZlcgHSIk=;
        b=FQl6iyiUcSm187mqujhr/sfRI/HxprD7EFONHVrcKihlrnQbhSHToEYHwFIcKfgngh
         wus/QSC3tugPVox9gv9V8jsQsuZTtuGJBrUQtQqCxrOVG2pRbFINVxEtyD9YfCvkbKY6
         7zmcEgB0YTfD5XyOvrIrZdTWogtYaUEK+KC8xLFEpfL8E/ffHfYkC7p2Tfoa71yYoX40
         QsgCJlR7zmC+dqtgJfcWo6W5hdt4aLKvmWRL/PU99rWGqzYKmU+jdGymy15LfnVjJ52Y
         lEJN3qg1dspI8ntOQa3Sz9dAaNu5yKuPjaDwW1lSEhtjcnGzeEF5IdAMI7Ptkn+uh2eT
         uKYw==
X-Gm-Message-State: AElRT7FY+NFOmG/ETecCofHqYqh6GMjg3BiPTjBPab1laRSJUhkbU15t
	bLpo6IlnALdWI7R/pNaxC3WaXL/ryfZsKAnA45o/IvmZ
X-Google-Smtp-Source: AIpwx49T89AOeM7TwLQsH/taeUqYv3bLM3KQM/bYYrs7azMQ+zDDxIdUSft++wnvCLQT2rpscqSHsjbIIcNvWRbnIXY=
X-Received: by 2002:a9d:1150:: with SMTP id p16-v6mr509655otp.209.1522182745227;
 Tue, 27 Mar 2018 13:32:25 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a9d:155b:0:0:0:0:0 with HTTP; Tue, 27 Mar 2018 13:32:24
 -0700 (PDT)
In-Reply-To: <2BE544BB-09A7-4323-9895-02F40C6FDFC6@hortonworks.com>
References: <17B91B6B0D9BBC44A1682DABC201C53552055763@SHSMSX104.ccr.corp.intel.com>
 <D220CF55-A229-4A61-AFD3-A799E3997E90@hortonworks.com> <CY1PR05MB24282204DAEF8BEFBFCB1C068DAD0@CY1PR05MB2428.namprd05.prod.outlook.com>
 <A434E2C3-5386-4A4A-A17B-F0EE979047E5@hortonworks.com> <CY1PR05MB2428F2798068321CE7B956088DAD0@CY1PR05MB2428.namprd05.prod.outlook.com>
 <2BE544BB-09A7-4323-9895-02F40C6FDFC6@hortonworks.com>
From: "Owen O'Malley" <owen.omalley@gmail.com>
Date: Tue, 27 Mar 2018 13:32:24 -0700
Message-ID: <CAHfHakF-fp_nutkU5pSsG7L8c-SvGwcs5pXLaRv-KLqkjWxF_w@mail.gmail.com>
Subject: Re: ORC double encoding optimization proposal
To: user@orc.apache.org
Cc: "dev@orc.apache.org" <dev@orc.apache.org>
Content-Type: multipart/alternative; boundary="000000000000d0072f05686ac9c4"

--000000000000d0072f05686ac9c4
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Going back to the point of double split encoding, it would make sense to
try a variant where we combine the sign and the mantissa. That should
remove the sign stream at a relatively little cost of making the mantissa
stream signed.

Thinking more about the layout options...

Another consideration is that we'd be better off not splitting the
compression chunks between ranges and yet I'm worried about the overhead of
closing all of the compression chunks and rle runs early.

So we could modify my #2 proposal to be sensitive to rle and compression
chunks. If at the end of the row group, we wait until the rle and
compression chunks close and interleave the streams. That means that for a
column with three streams and two row groups, we could something like:

stream1.1, stream2.1, stream3.1, stream1.2, stream2.2, stream3.2

stream x.y contains a whole number of compression chunks and the majority
of the data for row group X is in the stream *.X. This significantly
improves the current state of affairs because now we know that if we read
stream *.1, we'll have the entire first row group and can start
decompression and processing while we read the other "stripelets".

By not forcing the closure of the rle and compression, we have preserved
the compression and yet gained the ability to have async io in the reader.

.. Owen


On Sun, Mar 25, 2018 at 11:47 PM, Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

>
> >    2. Under seek or predicate pushdown scenario, there=E2=80=99s no nee=
d to load
> the entire stream.
>
> Yes, that is a valid scenario where the reader reads partial-streams &
> causes random IO.
>
> The current double encoding is actually 2 streams today & will continue t=
o
> use 2 streams for the FLIP implementation.
>
> The SPLIT implementation will go from the current 2 streams to 4 streams
> (i.e 1+1->1+3 streams) & the total data IO will drop by ~2x or so. More s=
o
> if one of the streams can be suppressed (like in my IoT data-set, where t=
he
> sign-bit is always +ve for my electric meter data).
>
> The trade-offs seem to be working out on regular HDDs with locality & for
> LLAP SSD caches - if your use-cases are different, I'd like to hear more
> about it.
>
> The only significant random IO delays expected seem to be entirely within
> the HDFS API network hops (which offers 0% locality when data is erasure
> coded or for cloud-storage), which I hope to fix in the Hadoop-3.x branch
> with a new API.
>
> Cheers,
> Gopal
>
>
>

--000000000000d0072f05686ac9c4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Going back to the point of double split encoding, it would=
 make sense to try a variant where we combine the sign and the mantissa. Th=
at should remove the sign stream at a relatively little cost of making the =
mantissa stream signed.<div><br></div><div>Thinking more about the layout o=
ptions...=C2=A0</div><div><br></div><div>Another consideration is that we&#=
39;d be better off not splitting the compression chunks between ranges and =
yet I&#39;m worried about the overhead of closing all of the compression ch=
unks and rle runs early.</div><div><br></div><div>So we could modify my #2 =
proposal to be sensitive to rle and compression chunks. If at the end of th=
e row group, we wait until the rle and compression chunks close and interle=
ave the streams. That means that for a column with three streams and two ro=
w groups, we could something like:</div><div><br></div><div>stream1.1, stre=
am2.1, stream3.1, stream1.2, stream2.2, stream3.2</div><div><br></div><div>=
stream x.y contains a whole number of compression chunks and the majority o=
f the data for row group X is in the stream *.X. This significantly improve=
s the current state of affairs because now we know that if we read stream *=
.1, we&#39;ll have the entire first row group and can start decompression a=
nd processing while we read the other &quot;stripelets&quot;.</div><div><br=
></div><div>By not forcing the closure of the rle and compression, we have =
preserved the compression and yet gained the ability to have async io in th=
e reader.</div><div><br></div><div>.. Owen</div><div><br></div></div><div c=
lass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Sun, Mar 25, 2018 at=
 11:47 PM, Gopal Vijayaraghavan <span dir=3D"ltr">&lt;<a href=3D"mailto:gop=
alv@apache.org" target=3D"_blank">gopalv@apache.org</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex"><span class=3D""><br>
&gt;=C2=A0 =C2=A0 2. Under seek or predicate pushdown scenario, there=E2=80=
=99s no need to load the entire stream.<br>
<br>
</span>Yes, that is a valid scenario where the reader reads partial-streams=
 &amp; causes random IO.<br>
<br>
The current double encoding is actually 2 streams today &amp; will continue=
 to use 2 streams for the FLIP implementation.<br>
<br>
The SPLIT implementation will go from the current 2 streams to 4 streams (i=
.e 1+1-&gt;1+3 streams) &amp; the total data IO will drop by ~2x or so. Mor=
e so if one of the streams can be suppressed (like in my IoT data-set, wher=
e the sign-bit is always +ve for my electric meter data).<br>
<br>
The trade-offs seem to be working out on regular HDDs with locality &amp; f=
or LLAP SSD caches - if your use-cases are different, I&#39;d like to hear =
more about it.<br>
<br>
The only significant random IO delays expected seem to be entirely within t=
he HDFS API network hops (which offers 0% locality when data is erasure cod=
ed or for cloud-storage), which I hope to fix in the Hadoop-3.x branch with=
 a new API.<br>
<br>
Cheers,<br>
Gopal<br>
<br>
<br>
</blockquote></div><br></div>

--000000000000d0072f05686ac9c4--