impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics in the parquet table writer
Date Fri, 02 Jun 2017 21:48:47 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table
writer
......................................................................


Patch Set 2:

(12 comments)

Thank you for working on this!

http://gerrit.cloudera.org:8080/#/c/7058/2//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer
Nit: Make the first line of the commit message fit in 80 chars. You can end it after "statistics"


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need to get min_value.
Let's only do the write path in this change and then address the read path in a subsequent
change.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS2, Line 198: {}
Do you need the default implementation here? Without it, you could also remove the "Implemented
in.." comment, since then it'd be clear.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79:   static bool ReadCountFromThrift(
I think it will be easier to return an int64 here. However we should do the read path in a
later change altogether.


Line 115:   bool has_null_count_;
Why do we need has_null_count_ instead of just always counting them? Are there types for which
we don't know the number of nulls we wrote?


Line 170:   void UpdateNullCount();
Can this go into the base class? It doesn't depend on T.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 50:   null_count_++;
nit: we use prefix operators, mostly for consistency, but sometimes for performance reasons.


http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 323:       if ((stats.min_value is None and stats.max_value is None) 
With the additional checks below, is this line still needed? What is its intent?

Please remove the trailing whitespace


Line 328:       min_value = max_value = null_count = None
You should have one assignment per line


Line 427:         ColumnStats('id', 0, 7299, None),
Why are these None and not 0? I think we should count 0s even if there are none.


PS2, Line 607: 	
Replace tabs with spaces, here and elsewhere


Line 620: 									"functional_parquet.zipcode_incomes", expected_min_max_values)
The formatting here seems off


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <pooja.nilangekar@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message