From dev-return-2070-archive-asf-public=cust-asf.ponee.io@orc.apache.org Thu Apr 12 21:54:48 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E430F180634 for ; Thu, 12 Apr 2018 21:54:47 +0200 (CEST) Received: (qmail 98713 invoked by uid 500); 12 Apr 2018 19:54:47 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 98671 invoked by uid 99); 12 Apr 2018 19:54:46 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Apr 2018 19:54:46 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 3CABFE09E2; Thu, 12 Apr 2018 19:54:46 +0000 (UTC) From: t3rmin4t0r To: dev@orc.apache.org Reply-To: dev@orc.apache.org References: In-Reply-To: Subject: [GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati... Content-Type: text/plain Message-Id: <20180412195446.3CABFE09E2@git1-us-west.apache.org> Date: Thu, 12 Apr 2018 19:54:46 +0000 (UTC) Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181202668 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The multiple-stream + row-group stride problems for IO were discussed by Owen. The disk layout is what matters for IO, not the logical stream separation. ---