Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A60C1200B49 for ; Wed, 20 Jul 2016 00:43:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A4D65160A8B; Tue, 19 Jul 2016 22:43:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE060160A76 for ; Wed, 20 Jul 2016 00:43:39 +0200 (CEST) Received: (qmail 14862 invoked by uid 500); 19 Jul 2016 22:43:39 -0000 Mailing-List: contact dev-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list dev@impala.incubator.apache.org Received: (qmail 14851 invoked by uid 99); 19 Jul 2016 22:43:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jul 2016 22:43:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 72D30C0BB4 for ; Tue, 19 Jul 2016 22:43:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id gFBp0V1l32PL for ; Tue, 19 Jul 2016 22:43:36 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7C0345F1F3 for ; Tue, 19 Jul 2016 22:43:35 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id u6JMhXtS013647; Tue, 19 Jul 2016 22:43:33 GMT Message-Id: <201607192243.u6JMhXtS013647@ip-10-146-233-104.ec2.internal> Date: Tue, 19 Jul 2016 22:43:33 +0000 From: "Matthew Jacobs (Code Review)" To: Thomas Tauber-Marshall , impala-cr@cloudera.com, dev@impala.incubator.apache.org CC: Tim Armstrong , Lars Volker Reply-To: mj@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-CR=5D=28cdh5-trunk=29_IMPALA-3376=3A_Extra_definition_level_when_writing_Parquet_files=0A?= X-Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 X-Gerrit-ChangeURL: X-Gerrit-Commit: 2e1244434a063b4c6226e74d67939faacca939d5 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.2 archived-at: Tue, 19 Jul 2016 22:43:40 -0000 Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3376: Extra definition level when writing Parquet files ...................................................................... Patch Set 5: (8 comments) http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS5, Line 381: Encoding may fail for several reasons - because the current page is not big enough, : // because we've encoded the maximum number of unique dictionary values and need to : // switch to plain encoding, etc. so we may need to try again more than once. I haven't spent a ton of time looking through all the table-writer code, so this could be a non-issue, but I'm a bit worried that a subtle bug in EncodeValue/FinalizeCurrentPage/NewPage could lead to infinite loops here, perhaps in corner cases with weird data. Is there a clear set of state transitions? This relies on EncodeValue() behaving properly, and it is hard to read this code and understand why it is _obviously correct_. I don't think your code increases the risk of issues, but worth thinking about any DCHECKs that could help. I haven't spent a ton of time looking through the rest of this code so maybe it's not an issue. http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc File be/src/util/parquet-reader.cc: PS5, Line 133: We i Remove we PS5, Line 146: with our RLE scheme it is not possible to determine how many values : // were actually written if the final run is a literal run, only if the final run is : // a repeated run. We can't we determine how many values were written in a literal run? PS5, Line 149: CheckDataPage I think the decompressing is getting confusing with the memory management. How about splitting out the decompression into a separate fn that takes both the compressed data buffer and a buffer already allocated by the caller (which should be of size header.uncompressed_page_size). Then the fn that actually does the work to check a data page can just take a const uint8_t* to uncompressed data. PS5, Line 149: uint8_t* data Please have the comment mention that data is decompressed if the header indicates it is compressed, and that this is an in/out parameter that will return the uncompressed data. PS5, Line 150: std::vector decompressed_buffer; why is this stack allocated? Isn't this out of scope why this fn returns but you return the pointer? PS5, Line 171: *reinterpret_cast(data); Can you add 1 sentence about the data layout or point to somewhere that does? PS5, Line 174: nit extra space -- To view, visit http://gerrit.cloudera.org:8080/3556 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes