From dev-return-2646-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Tue Oct  9 22:18:13 2018
Return-Path: <dev-return-2646-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 2440C180668
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  9 Oct 2018 22:18:12 +0200 (CEST)
Received: (qmail 97406 invoked by uid 500); 9 Oct 2018 20:18:12 -0000
Mailing-List: contact dev-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@orc.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@orc.apache.org>
List-Post: <mailto:dev@orc.apache.org>
List-Id: <dev.orc.apache.org>
Reply-To: dev@orc.apache.org
Delivered-To: mailing list dev@orc.apache.org
Received: (qmail 97395 invoked by uid 99); 9 Oct 2018 20:18:12 -0000
Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2018 20:18:12 +0000
Received: from [10.22.8.197] (unknown [192.175.27.10])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 50395EF0
	for <dev@orc.apache.org>; Tue,  9 Oct 2018 20:18:11 +0000 (UTC)
User-Agent: Microsoft-MacOutlook/10.11.0.180909
Date: Tue, 09 Oct 2018 13:18:08 -0700
Subject: Re: Orc v2 Ideas
From: Gopal Vijayaraghavan <gopalv@apache.org>
To: "dev@orc.apache.org" <dev@orc.apache.org>
Message-ID: <2C0C3010-8BEE-4456-B836-D0D7234B2A26@hortonworks.com>
Thread-Topic: Orc v2 Ideas
References: <CY1PR05MB24280771B5E985E8D14DB6048DEC0@CY1PR05MB2428.namprd05.prod.outlook.com>
 <CAHfHakF9UEmTXYM=y=xeCE0arPXTsCNu9aXkQVwsu0fgdmy=6w@mail.gmail.com>
 <F0E9F731-BFB4-4294-AA6C-810E8D691350@iq80.com>
 <CAHfHakEYD9jy1=B6DBLRGdRFJG13AxobKQ5qugz8_ccjZKeQ_w@mail.gmail.com>
 <A7CCB1AD-525A-4AE1-90B6-EC163F40B2A0@iq80.com>
 <CY1PR05MB242849B820CA35CC147494C88DE70@CY1PR05MB2428.namprd05.prod.outlook.com>
 <C9538FBA-92F7-46E8-8FB8-875AB4CDB137@iq80.com>
In-Reply-To: <C9538FBA-92F7-46E8-8FB8-875AB4CDB137@iq80.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="UTF-8"
Content-transfer-encoding: quoted-printable


>  How small are you trying to make the stripes?  I ask because all of the =
above should be small, so if they are dominating, I would assume the stripe =
is tiny or the compression really worked well.

I'm not in favour of stripelets for seek reasons, because reading a single =
column from a remote store is hit by the extra skipping over stripelet bound=
aries (or I read through the boundaries).

Flushing at fixed offsets across all columns would not suffer from that and=
 would not change the underlying read patterns.

There's already an "ORC gap cache" in LLAP to hack around the lack of these=
 boundaries, but something which I'd like to not keep around forever.

>  The ORC spec currently calls for sorted dictionaries, so if the they are=
 not sorted, they are not valid ORC files. =20
>   I find that most dictionary are a relatively small size compared to the=
 row count, so the cost of testing each entry isn=E2=80=99t a big deal.

I agree, moving that out of the spec would be a good thing.

The format can add a future optional stream which is "sort-order-index" whi=
ch contains the dictionary transform from unsorted/sorted (i.e dict-ids in b=
yte sorted order), so that the reader can remap it into a sorted list.

But removing the "always sort" dictionaries would be a good thing for write=
r throughput and memory consumption.

Cheers,
Gopal