impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately
Date Thu, 24 Aug 2017 21:51:59 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 964:   auto update_read_count = MakeScopeExitTrigger([&] {
> Done. If RETURN_IF_ERROR or num_tuples_mismatch branch is taken the value o
Yeah, it changes the semantics to something like "Number of successfully committed items/rows"
but I think that's ok, given it keeps the code simpler.


PS2, Line 978: continue_execution
> Reverted to original code. But is it good practice to add a dead initializa
You're right, I think it's better to move it into the loop here. If someone tries to use it
without initializing it, the compiler would issue a warning, too. Apologies for the back and
forth.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1008:     num_rows_read += scratch_batch_->num_tuples;
You could use COUNTER_SET here and get rid of num_rows_read altogether.


PS3, Line 1009:   int num_r
How about removing the initialization at the beginning of the method and using COUNTER_SET
here instead? This way it is also more in line with the variable comment in the .h file.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 470:   /// Number of scanners that end up doing no reads because their splits don't overlap
Can you describe the scope of this in the comment, e.g. "Total number of collection items
read by this scanner."

If you switch the scope to the scanner's lifetime you should also initialize it.


PS3, Line 551: d of the collection has 
coll_items_read_counter_.


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 242: /*num_coll_item_read*/
> The parameter doesn't exist now. Can we add that to the style guide wiki pa
Sure, feel free to add it here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536

I think it could also pass as "consistency with the surrounding code", though that's harder
to spot.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/scan-node.h
File be/src/exec/scan-node.h:

Line 153:   /// # collection items read from the scanner
Can you explain here that this is across nested collections, so that A = {B, C} would actually
result in three items being read?


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <twang@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <twang@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message