From dev-return-2785-archive-asf-public=cust-asf.ponee.io@orc.apache.org Tue Apr 2 22:51:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id D5014180668 for ; Wed, 3 Apr 2019 00:51:03 +0200 (CEST) Received: (qmail 96724 invoked by uid 500); 2 Apr 2019 22:51:02 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 96712 invoked by uid 99); 2 Apr 2019 22:51:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2019 22:51:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9EEFFC2481 for ; Tue, 2 Apr 2019 22:51:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id o2UIXRFinXCS for ; Tue, 2 Apr 2019 22:51:00 +0000 (UTC) Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id D93DE6242F for ; Tue, 2 Apr 2019 22:47:44 +0000 (UTC) Received: by mail-wm1-f48.google.com with SMTP id y197so5752794wmd.0 for ; Tue, 02 Apr 2019 15:47:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Xtwo44y89DPsf9hZ4UnE4Uni3yoOWnuR1J4Belie4vA=; b=H2N5P4v1j/SseKnZWMyxhMOl45UDJTvaeSyls/fulzjFBaWk3TroYYVjoLSh3BR9nZ DBJL3Bklch1EiRBZ9OXgY8v4ZyYrxQqXToihqKbCREto9T0UJSfgu+VWbAQJ5Z9WDY2U DHJnubAw7EMYA/XzHeTNE3J8Xx8QWFy/A2A6t6DWCkT0JYG/HXBHVkM6XmsWvisHgaJM nzz4vuSr9vGD6cCoTjHrCjxjJxZaKATQPkVqFET2rT3nX8dvc+mw0akEQ/cUUZk3iMrQ 2FkLpcb+u5HtMb74SdQGfioQ0Iwr9pMdEyj1DvmKo1cmG1yfjwKMuN/+4kvzafN6hz3r i+3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Xtwo44y89DPsf9hZ4UnE4Uni3yoOWnuR1J4Belie4vA=; b=ZXAP0m5p4lavyhLJvzUvqLa22sNS2wkshPuM3AfTX2NrkcxhAD4MDk3eBJM1rQ9WrN dhThswUH9/4xEm41dyNrn9i7AUbqu6+CBIXtYXPtdpHf60iWCOk0BlaQre2zHY41KNQz E1ZUATqZBlpqYQY31iIe8BCov2w5XIUT0ArWlCYY+S5/L2Zl09R7wf2u1u+R8r6JbHWB ZFwMxQ0fkK9/I06TSMVy5/JEg+QzLR2GMFTw9l6NXwln/KmhTVvg+pnMf4m0ZEcceywC xcpb4qeqD++yW94IEXhXmZE22+4Y5f67CZkGrXWiZKwHh0Lkip4Lj6kSWq/SoLHMF7SQ QWoQ== X-Gm-Message-State: APjAAAXpAfFnem6mOOXfVwwjmEUrNQpN/GLonHcyt8pQm7/oPqu/coKI GPOltHCdl7pS5YrL3o3IvkShr1ylXA2HHTsqEL687w== X-Google-Smtp-Source: APXvYqwJNxex6RTfHdvTnMZ9H2ECSKeXmyO6oBwVcwCyuk6KwflivnawFDqB1YQ4q+J5yKaOThqYVRoLTTkePX9vIok= X-Received: by 2002:a7b:cf2c:: with SMTP id m12mr5429752wmg.21.1554245263395; Tue, 02 Apr 2019 15:47:43 -0700 (PDT) MIME-Version: 1.0 References: <8C07A253-3BF7-4118-8EC3-34E5B26311F7@iq80.com> In-Reply-To: <8C07A253-3BF7-4118-8EC3-34E5B26311F7@iq80.com> From: "Owen O'Malley" Date: Wed, 3 Apr 2019 00:47:31 +0200 Message-ID: Subject: Re: Type length, scale, and precision? To: dev@orc.apache.org Content-Type: multipart/alternative; boundary="000000000000d190b4058593ecbf" --000000000000d190b4058593ecbf Content-Type: text/plain; charset="UTF-8" Sorry, I managed to miss this message. On Tue, Mar 19, 2019 at 9:31 PM Dain Sundstrom wrote: > For the types in the ORC footer, we have the following: > > // the maximum length of the type for varchar or char in UTF-8 characters > optional uint32 maximumLength = 4; > // the precision and scale for decimal > optional uint32 precision = 5; > optional uint32 scale = 6; > > If the maximumLength, is set to N, can I be confident that no value for > that column in the file will contain more than N UTF-8 characters? Is this > still true for concatenated ORC files. > Yes. The merger should insist that the schemas are the same for all merged files. We could consider loosening that restriction, but in all cases the length of the values must be less than the declared length in the footer. Until recently we had a bug that was truncating to N bytes instead of N UTF-8 characters. That was a mistake. > I have a similar question about DECIMAL. Decimal encoding currently uses > the SECONDARY stream to encode the "scale". Is this scale guaranteed to be > the same scale as the type scale in the footer? > In Hive 0.11 the decimal values didn't have a declared scale. That is why the scale is encoded per a value. For short decimals (p <= 18) in recent Hive/ORC versions, you'll have that guarantee. Otherwise, it still uses the HiveDecimalWritable code, which removes trailing zeros, so the scale for a value may be less than the declared scale. > Thanks, > > -dain > > > ---- > Dain Sundstrom > Co-founder @ Presto Software Foundation, Co-creator of Presto ( > https://prestosql.io) > > --000000000000d190b4058593ecbf--