From dev-return-2084-archive-asf-public=cust-asf.ponee.io@orc.apache.org Sat Apr 14 00:51:11 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 1F775180718 for ; Sat, 14 Apr 2018 00:51:10 +0200 (CEST) Received: (qmail 87396 invoked by uid 500); 13 Apr 2018 22:51:10 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 87276 invoked by uid 99); 13 Apr 2018 22:51:09 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Apr 2018 22:51:09 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 18BFCF32B1; Fri, 13 Apr 2018 22:51:09 +0000 (UTC) From: omalley To: dev@orc.apache.org Reply-To: dev@orc.apache.org References: In-Reply-To: Subject: [GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati... Content-Type: text/plain Message-Id: <20180413225109.18BFCF32B1@git1-us-west.apache.org> Date: Fri, 13 Apr 2018 22:51:09 +0000 (UTC) Github user omalley commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181456234 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { --- End diff -- Let's pull the Int128 out of DecimalStatistics. We will likely use it other places. One concern with this representation is that -1 is pretty painful. You'll get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 bytes total. Another alternative is to use the zigzag encoding for the combined 128 bit value: optional uint64 minLow = 4; optional uint64 minHigh = 5; p <= 18: minLow = zigzag(min) minHigh = 0 p > 18: minLow = low bits of zigzag(min) minHigh = high bits of zigzag(min) That would have a representation of 1 byte each for minLow and minHigh + 2 bytes for field identifier = 4. If we leave the Int128 level that would add an additional + 2 bytes. ---