From dev-return-2646-archive-asf-public=cust-asf.ponee.io@orc.apache.org Tue Oct 9 22:18:13 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2440C180668 for ; Tue, 9 Oct 2018 22:18:12 +0200 (CEST) Received: (qmail 97406 invoked by uid 500); 9 Oct 2018 20:18:12 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 97395 invoked by uid 99); 9 Oct 2018 20:18:12 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2018 20:18:12 +0000 Received: from [10.22.8.197] (unknown [192.175.27.10]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 50395EF0 for ; Tue, 9 Oct 2018 20:18:11 +0000 (UTC) User-Agent: Microsoft-MacOutlook/10.11.0.180909 Date: Tue, 09 Oct 2018 13:18:08 -0700 Subject: Re: Orc v2 Ideas From: Gopal Vijayaraghavan To: "dev@orc.apache.org" Message-ID: <2C0C3010-8BEE-4456-B836-D0D7234B2A26@hortonworks.com> Thread-Topic: Orc v2 Ideas References: In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable > How small are you trying to make the stripes? I ask because all of the = above should be small, so if they are dominating, I would assume the stripe = is tiny or the compression really worked well. I'm not in favour of stripelets for seek reasons, because reading a single = column from a remote store is hit by the extra skipping over stripelet bound= aries (or I read through the boundaries). Flushing at fixed offsets across all columns would not suffer from that and= would not change the underlying read patterns. There's already an "ORC gap cache" in LLAP to hack around the lack of these= boundaries, but something which I'd like to not keep around forever. > The ORC spec currently calls for sorted dictionaries, so if the they are= not sorted, they are not valid ORC files. =20 > I find that most dictionary are a relatively small size compared to the= row count, so the cost of testing each entry isn=E2=80=99t a big deal. I agree, moving that out of the spec would be a good thing. The format can add a future optional stream which is "sort-order-index" whi= ch contains the dictionary transform from unsorted/sorted (i.e dict-ids in b= yte sorted order), so that the reader can remap it into a sorted list. But removing the "always sort" dictionaries would be a good thing for write= r throughput and memory consumption. Cheers, Gopal