Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 91F7F200C52 for ; Mon, 10 Apr 2017 17:33:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 90955160B99; Mon, 10 Apr 2017 15:33:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D74B4160B85 for ; Mon, 10 Apr 2017 17:33:05 +0200 (CEST) Received: (qmail 97456 invoked by uid 500); 10 Apr 2017 15:33:05 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 97445 invoked by uid 99); 10 Apr 2017 15:33:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Apr 2017 15:33:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 623F4CA80D for ; Mon, 10 Apr 2017 15:33:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id xjvvULaHL0BA for ; Mon, 10 Apr 2017 15:33:03 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1D3E25F610 for ; Mon, 10 Apr 2017 15:33:03 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v3AFX1aE031486; Mon, 10 Apr 2017 15:33:01 GMT Message-Id: <201704101533.v3AFX1aE031486@ip-10-146-233-104.ec2.internal> Date: Mon, 10 Apr 2017 15:33:01 +0000 From: "Marcel Kornacker (Code Review)" To: Lars Volker , impala-cr@cloudera.com, reviews@impala.incubator.apache.org Reply-To: marcel@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4817=3A_Populate_Parquet_Statistics_for_Strings=0A?= X-Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 X-Gerrit-ChangeURL: X-Gerrit-Commit: 1cf5069d9d7589ecb20b1731855f768589a85fe5 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Mon, 10 Apr 2017 15:33:06 -0000 Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4817: Populate Parquet Statistics for Strings ...................................................................... Patch Set 2: (15 comments) http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 541: const string* thrift_stats = nullptr; bad variable name: this is a plain-encoded value. thrift_stats sounds like it's a struct (it's definitely not something that requires a plural). Line 546: thrift_stats = ParquetMetadataUtils::GetThriftStats( why did you break this up instead of having ReadFromThrift do the extra work? do you need GetThriftStats anywhere else? the old control flow was easier to follow. Line 556: if (!thrift_stats) continue; explicit comparison http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: /// This class, together with its derivatives, is used to track column statistics when track is really not a meaningful term here, it generally just means 'follow'. revise description. Line 36: /// We currently support tracking 'min_value' and 'max_value' values for statistics. The hopefully also min and max, no? tracking means reading/decoding. Line 46: /// We currently don't write statistics for DECIMAL values and TIMESTAMP values due to why is that still not the case? Line 66: const string& thrift_stats, const ColumnType& col_type, void* slot); you changed the type and meaning of a parameter, but you didn't change the name. Line 72: /// Creates a copy of the contents of this object. Some data types (e.g. StringValue) unclear what this does, because copies are usually returned or passed into something. Line 146: /// This class contains further type-specific behavior that is common only to a subset of why can't this be collapsed into a 2-level hierarchy? Line 151: protected: protected follows public section http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 34: inline int64_t TypedColumnStatsBase::BytesNeeded() const { hard to compare, please move the functions back to where they were. feel free to reorder at the end of the review cycle. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 243: int col_idx, const StatsField& stats_field, const ColumnType& col_type) { col_idx is unused. instead of passing in both the columnchunk and columntype, why not use col_chunk.meta_data.type? http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h File be/src/exec/parquet-metadata-utils.h: Line 60: static bool ReadOldStats(const ColumnType& col_type); this sounds like it's doing some reading. also, 'deprecated' instead of old. http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 344: BROTLI = 4; do we return an error when we see that codec? Line 567: /** Union containing the order used for min, max, and sorting values in a column why isn't this implied by the logical type of that column? if this is simply to mark "legacy" ordered columns, then why not simply have a bool here? -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Marcel Kornacker Gerrit-HasComments: Yes