impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files
Date Mon, 04 Jul 2016 12:43:31 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files
......................................................................


Patch Set 1:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS1, Line 382: true
can this run more than twice? Is there a way to write it without a loop? It looks like only
the first 4 lines would be repeated.


Line 395:         page_size_ = bytes_needed;
shouldn't this line be below the check? Otherwise you change page_size_ (which looks like
a side effect given it's a class member) and then decide to abort.


Line 412:   DCHECK(def_levels_->Put(value != NULL));
DCHECK is a no-op in release builds, so this would not get executed. That's why it was bool
ret = Put(); DCHECK(ret);.


http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/parquet-reader.cc
File be/src/util/parquet-reader.cc:

Line 132: // We inherit from RleDecoder to get access to repeat_count_, which
single line?


Line 139:   uint32_t repeat_count() { return repeat_count_; }
make const


Line 143: //   - Compressed pages can be uncompressed.
What does this mean? Would it be an error or is it allowed?


PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number
of
             : //     def levels - with our RLE scheme it is not possible to determine how
many values
             : //     were actually written if the final run is a literal run, only if the
final run is
             : //     a repeated run.
Is there a way to fix this?


PS1, Line 149: PageHeader header
const PageHeader& ? If you need the copy, making it explicitly in the code could be more
clear.


PS1, Line 154: malloc
I cannot see where this gets free'd. Does it leak?


PS1, Line 175: i++
nit: ++i


PS1, Line 181: ++count
Isn't count == i?


http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 165:   bool BufferFull() { return buffer_full_; }
Usually we seem to name these simple getters like the member, i.e. "buffer_full()", also it
can be const.


http://gerrit.cloudera.org:8080/#/c/3556/1/tests/query_test/test_writers.py
File tests/query_test/test_writers.py:

Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved.
Wouldn't this need to be the ASF header?


Line 22:   def test_hdfs_parquet_table_writer(self, vector):
this test might need an annotation to not be run in parallel (@pytest.mark.execute_serially),
or multiple parallel runs would collide on the same table name.


PS1, Line 24: """
I think these should be in an own line for multiline docstrings.


PS1, Line 30: copyToLocal
Where does this copy to? If it's the cwd you might want to create a temporary folder instead.


PS1, Line 34: impalad_basedir + '/util/parquet-reader'
use os.path.join()


PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f)
os.path.join()


Line 39:       check_call(['rm', '-r', 'hdfs_parquet_table_writer'])
use shutils.rmtree


-- 
To view, visit http://gerrit.cloudera.org:8080/3556
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message