From user-return-208-archive-asf-public=cust-asf.ponee.io@orc.apache.org Tue Mar 27 01:23:16 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 37B16180649 for ; Tue, 27 Mar 2018 01:23:16 +0200 (CEST) Received: (qmail 67853 invoked by uid 500); 26 Mar 2018 23:23:15 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 67833 invoked by uid 99); 26 Mar 2018 23:23:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2018 23:23:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 316C51A0804; Mon, 26 Mar 2018 23:23:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id jt8CXpst9s9E; Mon, 26 Mar 2018 23:23:13 +0000 (UTC) Received: from mail-ot0-f174.google.com (mail-ot0-f174.google.com [74.125.82.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AB7395F1A1; Mon, 26 Mar 2018 23:23:12 +0000 (UTC) Received: by mail-ot0-f174.google.com with SMTP id v64-v6so1409624otb.13; Mon, 26 Mar 2018 16:23:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=BjOVuwQ/Y9ImpEskstp7aoMu1Ry+IH2ddA2DFge/5xo=; b=JYDwxe32R+c/wEyw8p+eCM8QsOTBgUD+NN/IbWz/Gmbl2VZ3mEXrYNadYdUNPjiVFg +yAtmKcb6zWTuk3P1fH++DvnTgYoUmS8cD8fLF/Os8Tn5rCtUuSy+a2fSEMiCJJjfdgE 86LjLUfOGaf9vlJoxrqkWEaxocz0bZ/5TBdtjF7AZSjWhm2gHWOWtorgPBO0DJoxymET ebp7VCTGrjdIv9TABPaguEb/Ytwpd6/qQuVuHPw45nK2cJyLglvcx0rgne95zLTdxt/A CNsyeLDMnGzMVMDEeb64prS+pvnmEKW7BlY7nd70fPPN7ijZU2zwKsm06WrucxuFiVSj 13+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=BjOVuwQ/Y9ImpEskstp7aoMu1Ry+IH2ddA2DFge/5xo=; b=HmWRY5wDlD6Qjb95ztfzrLSLs0LEIzgb63NGoCVFUpF6wPPRyTqyyVwBURFhdavMT9 IaKY/8KqsvWJbiV50GbWZPJ1SS2f4sjX9vzO6CuxbogoeMgdu9J3iteH2YWY8abGxlV8 FTH5VBJ+04tSHiKQRSeokuyhlTTQWJ237lKfqaOpmxjyHsE5cN0f32l0LFre5Dc++nQj zznc4UbhbavYTkhocOaxxUQoFnWWNc7EqUWJCT8FnHFirTlgPWYzxDjaqyPSl1cfEDCE vmbprlmvLkhIdkjrmapeI1Dhh23hJTlgRZgyWvEP9DjuJ9YQ1QEMT52hlTWS0Q7B6a9C QVmg== X-Gm-Message-State: AElRT7GqIqdg7y/43RraZ9qQfc0hIvC2Aw/7WhIFqWa2TWH9OEuFGPTh KtpRho+xK9wLn1PFVCyDhyye5lQkY6mvsPz6cQUmaA== X-Google-Smtp-Source: AG47ELvzW0uyq6LXC3s2TyQB7nkvBR9eFAUOMa4OZGfljfrgGfhyHC6Xu/3AmCA0RTT9aHt4NOB6Nhqm8+DQmDqBlYU= X-Received: by 2002:a9d:cee:: with SMTP id o43-v6mr23381301otd.293.1522106591771; Mon, 26 Mar 2018 16:23:11 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:155b:0:0:0:0:0 with HTTP; Mon, 26 Mar 2018 16:23:10 -0700 (PDT) In-Reply-To: <8E7B3A58-2FFD-4336-A6FA-B79C3E3E851D@hortonworks.com> References: <17B91B6B0D9BBC44A1682DABC201C53552055763@SHSMSX104.ccr.corp.intel.com> <17B91B6B0D9BBC44A1682DABC201C5355205EF6D@SHSMSX104.ccr.corp.intel.com> <8E7B3A58-2FFD-4336-A6FA-B79C3E3E851D@hortonworks.com> From: "Owen O'Malley" Date: Mon, 26 Mar 2018 16:23:10 -0700 Message-ID: Subject: Re: ORC double encoding optimization proposal To: dev@orc.apache.org Cc: "user@orc.apache.org" Content-Type: multipart/alternative; boundary="000000000000b681960568590e1a" --000000000000b681960568590e1a Content-Type: text/plain; charset="UTF-8" This is a really interesting conversation. Of course, the original use case for ORC was that you were never reading less than a stripe. So putting all of the data streams for a column back to back, which isn't in the spec, but should be, was optimal in terms of seeks. There are two cases that violate this assumption: * you are using predicate push down and thus only need to read a few row groups. * you are extending the reader to interleave the compression and io. So a couple of layouts come to mind: * Finish the compression chunks at the row group (10k rows) and interleave the streams for the column for each row group. This would help with both predicate pushdown and the async io reader. We would lose some compression by closing the compression chunks early and have additional overhead to track the sizes for the row group. On the plus side we could simplify the indexes because the compression chunks would always align with with row groups. * Divide each 256k (larger?) with the proportional part of each stream. Thus if the column has 3 streams and they were 50%, 30%, and 20% we would take that much data from each 256k. This wouldn't reduce the compression or require any additional metadata, since the reader could determine the number of bytes of each stream per a "page". This wouldn't help very much for PPD, but would help for the async io reader. So which use case maters the most? What other layouts would be interesting? .. Owen On Mon, Mar 26, 2018 at 12:33 PM, Gopal Vijayaraghavan wrote: > > > the bad thing is that we still have TWO encodings to discuss. > > Two is exactly what we need, not five - from the existing ORC configs > > hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION]; > > FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, > though the regressions in compression over the PlainV2 is still bothering > me (which is why I went digging into the Zlib dictionary builder impl with > infgen). > > All comparisons below are for Size & against PlainV2 > > For Zlib, this is pretty bad for FLIP. > > ZLIB:HIGGS Regressing on FLIP by 6 points > ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points > ZLIB:IOT_METER Regressing on FLIP by 32 points > ZLIB:LIST_PRICE Regressing on FLIP by 36 points > ZLIB:PHONE Regressing on FLIP by 50 points > > SPLIT has no size regressions. > > With ZSTD SPLIT has a couple of regressions in size > > ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points > ZSTD:IOT_METER Regressing on FLIP by 17 points > ZSTD:HIGGS Regressing on FLIP by 18 points > ZSTD:LIST_PRICE Regressing on FLIP by 30 points > ZSTD:PHONE Regressing on FLIP by 55 points > > ZSTD:HIGGS Regressing on SPLIT by 10 points > ZSTD:PHONE Regressing on SPLIT by 3 points > > but FLIP still has more size regressions & big ones there. > > I'm continuing to mess with both algorithms, but I have wider problems to > fix in FLIP & at a lower algorithm level than in SPLIT. > > Cheers, > Gopal > > > --000000000000b681960568590e1a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
This is a really interesting conversation. Of course, the = original use case for ORC was that you were never reading less than a strip= e. So putting all of the data streams for a column back to back, which isn&= #39;t in the spec, but should be, was optimal in terms of seeks.=C2=A0
=
There are two cases that violate this assumption:
= * you are using predicate push down and thus only need to read a few row gr= oups.
* you are extending the reader to interleave the compressio= n and io.

So a couple of layouts come to mind:
=

* Finish the compression chunks at the row group (10k r= ows) and interleave the streams for the column for each row group.
=C2=A0 This would help with both predicate pushdown and the async io read= er.
=C2=A0 We would lose some compression by closing the compress= ion chunks early and have additional overhead to track the sizes for the ro= w group.
=C2=A0 On the plus side we could simplify the indexes be= cause the compression chunks would always align with with row groups.
=

* Divide each 256k (larger?) with the proportional part= of each stream. Thus if the column has 3 streams and they were 50%, 30%, a= nd 20% we would take
=C2=A0 that much data from each 256k. This w= ouldn't reduce the compression or require any additional metadata, sinc= e the reader could determine the number of
=C2=A0 bytes of each s= tream per a "page". This wouldn't help very much for PPD, but= would help for the async io reader.

So whic= h use case maters the most? What other layouts would be interesting?
<= div>
.. Owen

On Mon, Mar 26, 2018 at 12:33 PM, Gopal Vijayaraghavan= <gopalv@apache.org> wrote:

> the bad thing is that we still have TWO encodings to discuss.

Two is exactly what we need, not five - from the existing ORC config= s

hive.exec.orc.encoding.strategy=3D[SPEED, COMPRESSION];

FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, thoug= h the regressions in compression over the PlainV2 is still bothering me (wh= ich is why I went digging into the Zlib dictionary builder impl with infgen= ).

All comparisons below are for Size & against PlainV2

For Zlib, this is pretty bad for FLIP.

ZLIB:HIGGS Regressing on FLIP by 6 points
ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points
ZLIB:IOT_METER Regressing on FLIP by 32 points
ZLIB:LIST_PRICE Regressing on FLIP by 36 points
ZLIB:PHONE Regressing on FLIP by 50 points

SPLIT has no size regressions.

With ZSTD SPLIT has a couple of regressions in size

ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points
ZSTD:IOT_METER Regressing on FLIP by 17 points
ZSTD:HIGGS Regressing on FLIP by 18 points
ZSTD:LIST_PRICE Regressing on FLIP by 30 points
ZSTD:PHONE Regressing on FLIP by 55 points

ZSTD:HIGGS Regressing on SPLIT by 10 points
ZSTD:PHONE Regressing on SPLIT by 3 points

but FLIP still has more size regressions & big ones there.

I'm continuing to mess with both algorithms, but I have wider problems = to fix in FLIP & at a lower algorithm level than in SPLIT.

Cheers,
Gopal



--000000000000b681960568590e1a--