From dev-return-2084-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Sat Apr 14 00:51:11 2018
Return-Path: <dev-return-2084-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 1F775180718
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 14 Apr 2018 00:51:10 +0200 (CEST)
Received: (qmail 87396 invoked by uid 500); 13 Apr 2018 22:51:10 -0000
Mailing-List: contact dev-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@orc.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@orc.apache.org>
List-Post: <mailto:dev@orc.apache.org>
List-Id: <dev.orc.apache.org>
Reply-To: dev@orc.apache.org
Delivered-To: mailing list dev@orc.apache.org
Received: (qmail 87276 invoked by uid 99); 13 Apr 2018 22:51:09 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Apr 2018 22:51:09 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
	id 18BFCF32B1; Fri, 13 Apr 2018 22:51:09 +0000 (UTC)
From: omalley <git@git.apache.org>
To: dev@orc.apache.org
Reply-To: dev@orc.apache.org
References: <git-pr-245-orc@git.apache.org>
In-Reply-To: <git-pr-245-orc@git.apache.org>
Subject: [GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Content-Type: text/plain
Message-Id: <20180413225109.18BFCF32B1@git1-us-west.apache.org>
Date: Fri, 13 Apr 2018 22:51:09 +0000 (UTC)

Github user omalley commented on a diff in the pull request:

    https://github.com/apache/orc/pull/245#discussion_r181456234
  
    --- Diff: site/_docs/file-tail.md ---
    @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values.
     }
     ```
     
    -For decimals, the minimum, maximum, and sum are stored.
    +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
    +string representation is deprecated and DecimalStatistics uses integers
    +which have better performance.
     
     ```message DecimalStatistics {
      optional string minimum = 1;
      optional string maximum = 2;
      optional string sum = 3;
    +  message Int128 {
    --- End diff --
    
    Let's pull the Int128 out of DecimalStatistics. We will likely use it other places.
    
    One concern with this representation is that -1 is pretty painful. You'll get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but 9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14 bytes total. Another alternative is to use the zigzag encoding for the combined 128 bit value:
    
      optional uint64 minLow = 4;
      optional uint64 minHigh = 5;
    
    p <= 18:
      minLow = zigzag(min)
      minHigh = 0
    
    p > 18:
      minLow = low bits of zigzag(min)
      minHigh = high bits of zigzag(min)
    
    That would have a representation of 1 byte each for minLow and minHigh + 2 bytes for field identifier = 4. If we leave the Int128 level that would add an additional + 2 bytes. 


---