Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7A615200D2F for ; Wed, 1 Nov 2017 21:53:16 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 78F3B160BEA; Wed, 1 Nov 2017 20:53:16 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BDCED1609EC for ; Wed, 1 Nov 2017 21:53:15 +0100 (CET) Received: (qmail 51112 invoked by uid 500); 1 Nov 2017 20:53:15 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 51101 invoked by uid 99); 1 Nov 2017 20:53:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Nov 2017 20:53:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id EA1521A0C99 for ; Wed, 1 Nov 2017 20:53:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.362 X-Spam-Level: ** X-Spam-Status: No, score=2.362 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 2c5A3qazueYl for ; Wed, 1 Nov 2017 20:53:12 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9A9A760D1C for ; Wed, 1 Nov 2017 20:53:11 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id vA1Kr8KH023679; Wed, 1 Nov 2017 20:53:08 GMT Message-Id: <201711012053.vA1Kr8KH023679@ip-10-146-233-104.ec2.internal> X-Gerrit-PatchSet: 7 Date: Wed, 1 Nov 2017 20:53:08 +0000 From: "Tim Armstrong (Code Review)" To: Bikramjeet Vig , impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Matthew Jacobs , Lars Volker , Dan Hecht X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-2494=3A_Support_for_byte_array_encoded_decimals_in_Parquet_scanner=0A?= X-Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e X-Gerrit-Change-Number: 7822 X-Gerrit-ChangeURL: X-Gerrit-Commit: 1aaef9b6861d8814f2b04a5fca7c9aab3c271abc In-Reply-To: References: X-Gerrit-Comment-Date: Wed, 1 Nov 2017 20:53:08 +0000 Reply-To: tarmstrong@cloudera.com, impala-cr@cloudera.com, lv@cloudera.com, marcelk@gmail.com, dhecht@cloudera.com, reviews@impala.incubator.apache.org, bikramjeet.vig@cloudera.com, mjacobs@apache.org MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.14.2 Content-Type: multipart/alternative; boundary="i94z0X8Xx/o="; charset=UTF-8 archived-at: Wed, 01 Nov 2017 20:53:16 -0000 --i94z0X8Xx/o= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Tim Armstrong has posted comments on this change=2E ( http://gerrit=2Ecloud= era=2Eorg:8080/7822 ) Change subject: IMPALA-2494: Support for byte array = encoded decimals in Parquet scanner =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E= =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E= =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E= =2E=2E=2E=2E=2E=2E=2E Patch Set 7: (2 comments) I think we'll need to d= o some more work on testing for the int32/64 patches - we don't have pre-ex= isting test files from parquet-mr=2E I think we'll have to generate some mo= re test files with parquet-mr for the other cases, and we could consider tu= rning that code into a data generator to generate more test files=2E From w= hat I could tell Hive doesn't have a knob to generate some of these alterna= tive output encodings=2E I feel ok with the coverage since we have end-to= -end tests then more exhaustive unit tests for the various ways of encoding= it=2E http://gerrit=2Ecloudera=2Eorg:8080/#/c/7822/7/be/src/exec/parquet-= common=2Eh File be/src/exec/parquet-common=2Eh: http://gerrit=2Ecloudera= =2Eorg:8080/#/c/7822/7/be/src/exec/parquet-common=2Eh@391 PS7, Line 391: fi= xed_len_size Looked again=2E The variable name (and recycling the argument = storage) is confusing=2E Maybe 'encoded_byte_size'? http://gerrit=2Ecloud= era=2Eorg:8080/#/c/7822/7/be/src/exec/parquet-plain-test=2Ecc File be/src/e= xec/parquet-plain-test=2Ecc: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7822/= 7/be/src/exec/parquet-plain-test=2Ecc@33 PS7, Line 33: int EncodeVarLenDeci= mal(const DECIMAL_TYPE& t, int fixed_len_size, uint8_t* buffer){ I took ano= ther look at the standard and it says that the minimum number of bytes requ= ired to store the unscaled value should be used: https://github=2Ecom/apach= e/parquet-format/blob/master/LogicalTypes=2Emd#decimal I think this means = that we should not be including any preceding "0" bytes=2E I=2Ee=2E we shou= ld not have a fixed_len_size argument and instead determine the size based = on the number of leading zero bytes in the value=2E -- To view, visit h= ttp://gerrit=2Ecloudera=2Eorg:8080/7822 To unsubscribe, visit http://gerrit= =2Ecloudera=2Eorg:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: = master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecb= a53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 7 Gerrit-Ow= ner: Bikramjeet Vig Gerrit-Reviewer: Bikr= amjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Lars Volker Gerri= t-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tim Arms= trong Gerrit-Comment-Date: Wed, 01 Nov 2017 20:= 53:08 +0000 Gerrit-HasComments: Yes --i94z0X8Xx/o=--