Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 802F9200C0E for ; Wed, 18 Jan 2017 00:04:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 7F3E3160B52; Tue, 17 Jan 2017 23:04:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C7EE3160B46 for ; Wed, 18 Jan 2017 00:04:48 +0100 (CET) Received: (qmail 34215 invoked by uid 500); 17 Jan 2017 23:04:48 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 34197 invoked by uid 99); 17 Jan 2017 23:04:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jan 2017 23:04:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 50F2AC255F for ; Tue, 17 Jan 2017 23:04:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id BA-rTy1Eo0Dp for ; Tue, 17 Jan 2017 23:04:44 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1147B5FB73 for ; Tue, 17 Jan 2017 23:04:43 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v0HN4gcS025831; Tue, 17 Jan 2017 23:04:42 GMT Message-Id: <201701172304.v0HN4gcS025831@ip-10-146-233-104.ec2.internal> Date: Tue, 17 Jan 2017 23:04:42 +0000 From: "Lars Volker (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Tim Armstrong Reply-To: lv@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3909=3A_Populate_min/max_statistics_in_Parquet_writer=0A?= X-Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 X-Gerrit-ChangeURL: X-Gerrit-Commit: 15ece15ad27f454b2bd51d7dbc10ad33ca267bc4 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.2 archived-at: Tue, 17 Jan 2017 23:04:49 -0000 Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ...................................................................... Patch Set 2: (15 comments) Thanks for the review, and apologies for the long delay. I ran into several issues when comparing statistics with Hive, until I found PARQUET-251, which seems to cause invalid stats to be written by Hive. http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG Commit Message: Line 9: Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 > Do we have end-to-end testing that the column stats are correct? We really There is a simple test in test_insert_parquet.py but it only checks rowgroup statistics, not page statistics. I think it will be easier to write more extensive tests once the read path has been implemented. http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 90: BaseColumnWriter(HdfsParquetTableWriter* parent, ExprContext* expr_ctx, > I feel like this file is getting pretty large. Can we pull these classes ou Done Line 97: num_values_(0), > I think we call this Merge() in other places like aggregate functions. Done Line 101: def_levels_(NULL), > I think the memory management here needs a bit more explanation. I.e. when Done Line 206: // levels data and the encoded values. If compression is enabled, this is the > Do we know what ordering the parquet specification defines for each type? T I added a comment to the class header. To me it looks safe to rely on the overloaded operators. Should we try and explicitly stamp out the template class only for the supported types to prevent new types from being added without manual interference? Are there any particular classes that you think may break in the future? Line 224: scoped_ptr compressor_; > I guess this is the explanation of Copy() I was looking for. I think we nee Done Line 288: new ColumnStats(parent_->per_file_mem_pool_.get(), encoded_value_size_)); > This can potentially use a lot of memory since the old strings are never fr Done PS1, Line 349: value to h > Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMem Done Line 485: > Can't we initialise most of these using the initialiser list? We should do I moved the initialization to Reset() to pass the correct memory pool, similar to dict_encoder_. Line 598: > Can't we initialise most of these using the initialiser list? We should do I made the *_stats_ inline members, so their default initialization is sufficient. The *_stats_base members are stored in the base class. I'm reluctant to move them into the c'tor, since for all other column writers they will not be initialized at construction time, but when calling Reset(). Line 628: // TODO: copy repetition data when we support nested types. > Do we need scoped_ptr? Can we just store these inline? Done http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 214: > Extra line Done Line 281: 'hive_skip_col_idx' are excluded from the verification of the expected values. > I think we need to find a way to test timestamp. Done Line 293: > We should also test the types that aren't in alltypes - char, varchar, deci Done Line 299: (qualified_table_name, source_table, num_lines) > Does this exercise the string copying code? If the table is too small it se I added a test using tpch_parquet.customer and manually verified that the resulting .parq file has multiple pages per column, thus exercising the string copy code. In hindsight it should already be sufficient to use a file with more multiple row batches, which functional.alltypes (7300 rows) should provide. I still think it makes sense to have a test with multiple pages per column so I'd opt to keep the new one, too. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes