From dev-return-2594-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Fri Sep 21 02:16:12 2018
Return-Path: <dev-return-2594-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 3CDE1180671
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 21 Sep 2018 02:16:12 +0200 (CEST)
Received: (qmail 85071 invoked by uid 500); 21 Sep 2018 00:16:11 -0000
Mailing-List: contact dev-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@orc.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@orc.apache.org>
List-Post: <mailto:dev@orc.apache.org>
List-Id: <dev.orc.apache.org>
Reply-To: dev@orc.apache.org
Delivered-To: mailing list dev@orc.apache.org
Received: (qmail 85052 invoked by uid 99); 21 Sep 2018 00:16:10 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Sep 2018 00:16:10 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 47DC8C1AE0
	for <dev@orc.apache.org>; Fri, 21 Sep 2018 00:16:10 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.889
X-Spam-Level: *
X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id BPrriOnL6Z8U for <dev@orc.apache.org>;
	Fri, 21 Sep 2018 00:16:06 +0000 (UTC)
Received: from mail-it1-f177.google.com (mail-it1-f177.google.com [209.85.166.177])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6AE215F17B
	for <dev@orc.apache.org>; Fri, 21 Sep 2018 00:16:06 +0000 (UTC)
Received: by mail-it1-f177.google.com with SMTP id h1-v6so184667itj.4
        for <dev@orc.apache.org>; Thu, 20 Sep 2018 17:16:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=YnuxNQ3CO98HXXwx4oHpTMWcAeAHd3BA6sfW6QOtc5M=;
        b=EBU/gFmC530h4KjwGmdoHuOVqm70ZP7FBysTAY2h/mRwEuqDJU4zAWbSLfKSgGZthS
         UuGtuzud0AGUkT/v6C7SJqYC9qFaUTUma560HR4QQX1PSvwsT7LG8gMBk+9oHfMKX4P4
         G7RAYEIYTa+hsEaeAZUwiibWNloI4pcVO6kIAqZxonVraiQmmkyg4jrisBLYIqBh/W/u
         vrtJWXEujvrNTmGwyedtx9mqCDBMnWlzrPDOYfxEuPvfwGxEiiYR+kk6Rmh2Gitau94g
         iWi4wqrqhsrRo65uv7M1YaSI9yjoK9rNCTifC34KaTTF6VMyvod5bTp5nt1e+s4DHKZo
         VY0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=YnuxNQ3CO98HXXwx4oHpTMWcAeAHd3BA6sfW6QOtc5M=;
        b=X4Tnns13c4ExYQSc0dcx8Mw4Xl23iP6slBT4Kbo+45DfHOtLwKkS/sF8FGZbFXM1dX
         +3423wDGSOfRAGl7z2DwoX40traZyg9JN5gCtNh7JLq+/CsJKZm5rw7Kay5rvH4FfYo4
         ka8/+D9wZDQCbfXSKyT+JPgvfCdIHIl7rlV3lVCcEAcK/ITaRg1HKYdkq19Lw0eB6kOI
         Ht+z7swKKXZHvJPA8CWNXb1ysh1CwEtOnvFnb2qyINGUEM/S4aGOTYbqxZOo72HeDnjj
         Wx78EJ2F10AI0kETBOKX+KlfrDiG3b+AEJuLZWlAJYl4nTWzBo9p+vvag+N6WiaPWgJi
         fk4Q==
X-Gm-Message-State: APzg51BIZR1cf1PQ3lSh1LLKEJrIqyo1CiW1diAestldXpIzejx6GZOb
	PmY29jBDQq1JSqpIm3qWQtvbgTtz/oVulfN46dtXZo7H
X-Google-Smtp-Source: ANB0VdaHN/Xr+I2p3zq1ewIVTzntYdX/QfYc+FaXPYIRkoYt2Gr8RuozYKlkd/angJZlrWXFCcjJsUOpI0sJWE8ZTgY=
X-Received: by 2002:a24:1355:: with SMTP id 82-v6mr4464649itz.74.1537488960475;
 Thu, 20 Sep 2018 17:16:00 -0700 (PDT)
MIME-Version: 1.0
References: <CAEokuX_vRy3q8bMRavpNxsT3PS1DVJXKxcwX7q4T31=k0uRFjQ@mail.gmail.com>
 <C1C082A9-B7CD-4FF7-A9E0-9DC3D5A286DD@hortonworks.com> <CAHfHakHBR-5dA_9UEmPL9jk9Ewu=Vj4yvxsF8QEeSUoaQSgDfg@mail.gmail.com>
 <CAEokuX_FMY84K5FXL3XV0W1LzKmmC2Nwm5tAhOpn1J6AGWH1uA@mail.gmail.com> <CAHfHakGrn=eDNgjm9qGUYDitq-5xj6GsBdQ3tHJPEuAXJGQMpw@mail.gmail.com>
In-Reply-To: <CAHfHakGrn=eDNgjm9qGUYDitq-5xj6GsBdQ3tHJPEuAXJGQMpw@mail.gmail.com>
From: Gang Wu <ustcwg@gmail.com>
Date: Thu, 20 Sep 2018 17:15:49 -0700
Message-ID: <CAEokuX8vrSZCbFztWSEvjUF1r6-p7MBJCCCqdOk-iu15Vb5MzA@mail.gmail.com>
Subject: Re: [Discussion] Base 128 variable integer encoding is not always good
To: dev@orc.apache.org
Content-Type: multipart/alternative; boundary="00000000000055d5f60576568b32"

--00000000000055d5f60576568b32
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Owen,

Yes, you are correct. I misunderstood RLEv2 which does not use LEB128.

To answer your question:
1. RLEv1 + fixed 8 byte in my experiment means that we don't do LEB128
encoding for RLE literals and directly write fixed 8 bytes in little
endian.
2. The data is from our production data which is primary key of a table and
it is naturally sorted.

To summarize my investigation. We have some datasets (which are not good
for RLE) that indicates that disabling RLE (directly write 64-bit integers
in little endian) is better than either RLEv1 or RLEv2 when compressor is
zstd. Please see the chart below:
dataset NO RLE RLEv1 RLEv1 + fixed 64bit RLEv2
1 638,920 1,188,651 685,522 533,710
2 801,544 985,595 871,753 929,763
3 928,168 1,271,394 1,024,290 1,282,509
4 856,264 987,859 895,738 1,000,487

I agree that we shouldn't break compression+encoding in ORC1. And this is
exactly a good opportunity for RLEv3 in ORC2 to improve. IMO, the overhead
is brought by three factors:
1. zigzag encoding to encode sign. Its overhead is fairly low - according
to my experiment - 2%. This cannot be discarded as we need an approach to
encode sign.
2. RLE headers. Comparing `NO RLE` and `RLEv1 + fixed 64bit` columns in the
above chart, we can get roughly the overhead of RLE headers after
compression.
3. The largest overhead is the LEB128 in RLEv1 and BitPacking in RLEv2
DIRECT mode which break the regularity of ZSTD.

I will do more investigations and try to use the benchmark tools. Will post
here if I have some new findings.

Thanks,
Gang

On Wed, Sep 19, 2018 at 3:25 PM Owen O'Malley <owen.omalley@gmail.com>
wrote:

> Thanks for the sample data.
>
> Just out of curiosity, is the natural data actually sorted like that?
>
> I think you have a misunderstanding of RLEv2. It doesn't use LEB128 excep=
t
> for the values in the header. What does RLEv1 + fixed 8 byte mean?
>
> Based on the 512 values that you posted, I see:
>
> 512 values, min =3D 16430, max =3D 2403786, minDelta =3D 6, maxDelta =3D =
78612
> order =3D incr, bitLen =3D 22, deltaBitLen =3D 17
>
> so the RLEv2 should have used the delta encoding. RLEv2 should have used =
24
> bit for each of the values in the encoding. Although with the bitLen and
> deltaBitLen both between 16 and 24 bits, the delta encoding doesn't help
> much. Anyways looking at what those 512 sample numbers will look like in
> RLEv2:
>
> header: 2 bytes
> base: 3 bytes
> firstDelta: 3 bytes
> rest: 510 * 3 bytes
> total: 1538 bytes
>
> compared the direct encoding of 512 * 8 bytes =3D 4096 bytes. The RLEv2 i=
s at
> 38% of the direct 8 byte encoding. Even if the data wasn't sorted, it
> should end up with patched base with a similar size (~3 bytes/value).
>
> Part of the reason that we don't use the odd bit sizes that are defined f=
or
> RLEv2 was precisely because zlib didn't compress well with the non-byte
> aligned data. Have you tried extending the java/benchmarks with zstd to s=
ee
> what happens with other data sets? I guess you could add an ORC0 option t=
o
> the benchmark to compare RLEv1 to RLEv2 under each of the compression
> codecs.
>
> This is a great conversation and I think it will be great for the RLEv3 a=
nd
> ORC 2 format. In ORC 1, I don't think we should use the compression codec
> as a factor to disable or change the RLE, especially based on a single da=
ta
> set. I'd be more tempted to use zstd with the level set high enough that =
it
> is useful on RLE data, since that doesn't break any old readers.
>
> .. Owen
>
> On Tue, Sep 18, 2018 at 8:48 PM Gang Wu <ustcwg@gmail.com> wrote:
>
> > Owen
> >   I have put the example data to reproduce the issue in
> > https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
> > numbers which are already zigzag-encoded using (val =C2=AB 1) ^ (val =
=C2=BB 63).
> The
> > low overhead representation of literals is exactly what we need for
> RLEv3.
> > We should also pay attention that zstd does not work well with LEB128 b=
ut
> > zlib can get better compression ratio with LEB128. There is no
> one-for-all
> > solution and we may come up with several optimal combinations of encodi=
ng
> > and compression settings.
> >
> > Gopal
> >   DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. =
I
> > will take a look at the orc.encoding.strategy setting.
> >
> > Thanks!
> > Gang
> >
> > On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <owen.omalley@gmail.com>
> > wrote:
> >
> > > Gang,
> > >    As you correctly point out, some columns don't work well with RLE.
> > > Unfortunately, without being able to look at the data it is hard for =
me
> > to
> > > guess what the right compression strategies are. Based on your
> > description,
> > > I would guess that the data doesn't have a lot of patterns to it and
> > covers
> > > the majority of the 64 bit integer space. I think the best approach
> would
> > > be to make sure that RLEv3 has a low overhead representation of
> literals.
> > > So a literal mode something like:
> > >
> > > header: 2 bytes (literal, 512 values, size 64bit)
> > > data: 512 * 8 bytes
> > >
> > > So the overhead would be roughly 2/4096 =3D 0.005.
> > >
> > > Thoughts?
> > >
> > > On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <
> gopalv@apache.org>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > >  From above observation, we find that it is better to disable
> LEB128
> > > > encoding while zstd is used.
> > > >
> > > > You can enable file size optimizations (automatically recommend
> better
> > > > layouts for compression) when
> > > >
> > > > "orc.encoding.strategy"=3D"COMPRESSION"
> > > >
> > > > There are a bunch of bitpacking loops that's controlled by that fla=
g
> > > > already.
> > > >
> > > > >     https://github.com/facebook/zstd/issues/1325.
> > > >
> > > > If I understand that correctly, a DIRECT_V2 would also work fine fo=
r
> > the
> > > > numeric sequences in Zstd instead?
> > > >
> > > > Cheers,
> > > > Gopal
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

--00000000000055d5f60576568b32--