Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D283C200C0B for ; Sun, 29 Jan 2017 23:19:29 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D1213160B4F; Sun, 29 Jan 2017 22:19:29 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 26150160B47 for ; Sun, 29 Jan 2017 23:19:29 +0100 (CET) Received: (qmail 37701 invoked by uid 500); 29 Jan 2017 22:19:28 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 37689 invoked by uid 99); 29 Jan 2017 22:19:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Jan 2017 22:19:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B3D64C1446 for ; Sun, 29 Jan 2017 22:19:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id US5iS9wD666A for ; Sun, 29 Jan 2017 22:19:23 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CD30C5F2F1 for ; Sun, 29 Jan 2017 22:19:22 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v0TMJLj2030670; Sun, 29 Jan 2017 22:19:21 GMT Message-Id: <201701292219.v0TMJLj2030670@ip-10-146-233-104.ec2.internal> Date: Sun, 29 Jan 2017 22:19:21 +0000 From: "Marcel Kornacker (Code Review)" To: Lars Volker , impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Zoltan Ivanfi , Mostafa Mokhtar , Michael Brown , Tim Armstrong Reply-To: marcel@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3909=3A_Populate_min/max_statistics_in_Parquet_writer=0A?= X-Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 X-Gerrit-ChangeURL: X-Gerrit-Commit: 0a907da3dd1af02d92a25db26ef007c79cc67abc In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.2 archived-at: Sun, 29 Jan 2017 22:19:30 -0000 Marcel Kornacker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ...................................................................... Patch Set 9: (16 comments) http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 134: void EncodeColumnStats(ColumnMetaData* meta_data) { find a better name. 'column stats' is not a thrift concept. these are specifically row group stats. Line 236: // Created and set by the derived class. owner? same for the other pointer members. Line 339: int64_t encoded_value_size_; this seems to be the plain encoding size. even for dict-encoded cols? Line 347: // Tracks statistics per row group. This gets reset when starting a new file. hopefully when starting a new row group Line 643: DCHECK(page_stats_base_ != nullptr); how does this handle unsupported types? Line 1028: columns_[i]->EncodeColumnStats(¤t_row_group_->columns[i].meta_data); where do the row group stats get reset? http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 103: /// Maximum statistics size. If the combined size of the min and max values of does this refer to a single thrift Statistics struct? if so, spell that out. http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 65: void EncodeToThrift(T* parent) const { this feels more convoluted than it needs to be. i think it would be better for this class only to deal with thrift::Statistics and let the caller make the appropriate __set_xxx call (which means you won't need a templatized function). Line 88: // We explicitly require types to be listed here in order to support column statistics. i don't understand, i thought those listed types are specifically not supported. what exactly does this do? Line 90: // follow the ordering semantics of parquet's min/max statistics for the new type. what are the ordering semantics? (that order as byte sequence == value order?) Line 97: T>::type; i find the formatting hard to decipher. please reformat by hand (for instance, by move the first is_arithmetic to a new line, which would make the argument grouping clearer). Line 127: // statistics behavior from any implicit behavior of the types? but shouldn't the stats reflect the behavior of the underlying types. ie, why should the stats '<' be any different than the '<' of the underlying type? Line 148: /// Encodes a single value into an output string using parquet's plain encoding. 'an output string' makes it sound like this gets converted into a string type, ie, byte_array in parquet parlance. but plain encoding requires int32, int64, etc., parquet types. you're encoding as 'plain', stored in a binary string. best to make that clear in the comment. (also, what does 'output' mean here?) Line 159: return encoded_value_size_ < 0 ? ParquetPlainEncoder::ByteSize(v) : reformat by hand http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: Line 89: static int ByteSize(const T& v) { return sizeof(T); } does this function make sense at all? why not simply call sizeof()? http://gerrit.cloudera.org:8080/#/c/5611/9/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: Line 90: """Decode parquet statistics values that are encoded with PLAIN encoding.""" "that are encoded": do you mean "expects 'value' to be plain encoded"? also, why is this specific to stats (as opposed to any plain-encoded value)? -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Ivanfi Gerrit-HasComments: Yes