impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files
Date Mon, 04 Jul 2016 12:43:31 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files

Patch Set 1:

File be/src/exec/

PS1, Line 382: true
can this run more than twice? Is there a way to write it without a loop? It looks like only
the first 4 lines would be repeated.

Line 395:         page_size_ = bytes_needed;
shouldn't this line be below the check? Otherwise you change page_size_ (which looks like
a side effect given it's a class member) and then decide to abort.

Line 412:   DCHECK(def_levels_->Put(value != NULL));
DCHECK is a no-op in release builds, so this would not get executed. That's why it was bool
ret = Put(); DCHECK(ret);.
File be/src/util/

Line 132: // We inherit from RleDecoder to get access to repeat_count_, which
single line?

Line 139:   uint32_t repeat_count() { return repeat_count_; }
make const

Line 143: //   - Compressed pages can be uncompressed.
What does this mean? Would it be an error or is it allowed?

PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number
             : //     def levels - with our RLE scheme it is not possible to determine how
many values
             : //     were actually written if the final run is a literal run, only if the
final run is
             : //     a repeated run.
Is there a way to fix this?

PS1, Line 149: PageHeader header
const PageHeader& ? If you need the copy, making it explicitly in the code could be more

PS1, Line 154: malloc
I cannot see where this gets free'd. Does it leak?

PS1, Line 175: i++
nit: ++i

PS1, Line 181: ++count
Isn't count == i?
File be/src/util/rle-encoding.h:

Line 165:   bool BufferFull() { return buffer_full_; }
Usually we seem to name these simple getters like the member, i.e. "buffer_full()", also it
can be const.
File tests/query_test/

Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved.
Wouldn't this need to be the ASF header?

Line 22:   def test_hdfs_parquet_table_writer(self, vector):
this test might need an annotation to not be run in parallel (@pytest.mark.execute_serially),
or multiple parallel runs would collide on the same table name.

PS1, Line 24: """
I think these should be in an own line for multiline docstrings.

PS1, Line 30: copyToLocal
Where does this copy to? If it's the cwd you might want to create a temporary folder instead.

PS1, Line 34: impalad_basedir + '/util/parquet-reader'
use os.path.join()

PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f)

Line 39:       check_call(['rm', '-r', 'hdfs_parquet_table_writer'])
use shutils.rmtree

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-HasComments: Yes

View raw message