impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files
Date Thu, 07 Jul 2016 22:54:42 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files

Patch Set 1:

File be/src/exec/

PS1, Line 382: true
> can this run more than twice? Is there a way to write it without a loop? It
Yes, there are multiple independent reasons why EncodeValue may fail and (though highly unlikely)
its possible the loop will run more than once. I've updated the comment to reflect this.

Line 395:         page_size_ = bytes_needed;
> shouldn't this line be below the check? Otherwise you change page_size_ (wh
Good point. This isn't really new code, this whole block was just copy and pasted for this
review, so its interesting this wasn't a problem before.

I don't see any reason why it shouldn't be safe to make this change, though, so I'll do it.

Line 412:   DCHECK(def_levels_->Put(value != NULL));
> DCHECK is a no-op in release builds, so this would not get executed. That's
File be/src/util/

Line 132: // We inherit from RleDecoder to get access to repeat_count_, which
> single line?

Line 139:   uint32_t repeat_count() { return repeat_count_; }
> make const

Line 143: //   - Compressed pages can be uncompressed.
> What does this mean? Would it be an error or is it allowed?

PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number
             : //     def levels - with our RLE scheme it is not possible to determine how
many values
             : //     were actually written if the final run is a literal run, only if the
final run is
             : //     a repeated run.
> Is there a way to fix this?
Not without reducing the efficiency of our rle encoder (this behavior has to do with byte
alignment), which doesn't seem worth it just to allow for this testing.

PS1, Line 149: PageHeader header
> const PageHeader& ? If you need the copy, making it explicitly in the code 

PS1, Line 154: malloc
> I cannot see where this gets free'd. Does it leak?

PS1, Line 175: i++
> nit: ++i

PS1, Line 181: ++count
> Isn't count == i?
File be/src/util/rle-encoding.h:

Line 165:   bool BufferFull() { return buffer_full_; }
> Usually we seem to name these simple getters like the member, i.e. "buffer_
File tests/query_test/

Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved.
> Wouldn't this need to be the ASF header?

Line 22:   def test_hdfs_parquet_table_writer(self, vector):
> this test might need an annotation to not be run in parallel (@pytest.mark.
We want to be able to run in parallel, so I'll update it to use unique_database

PS1, Line 24: """
> I think these should be in an own line for multiline docstrings.
We're inconsistent about this - a quick grep through the python tests looks like we put it
on its own line less than half of the time.

I don't have an opinion either way, but it would be nice if we had a python style guide to

PS1, Line 30: copyToLocal
> Where does this copy to? If it's the cwd you might want to create a tempora

PS1, Line 34: impalad_basedir + '/util/parquet-reader'
> use os.path.join()

PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f)
> os.path.join()

Line 39:       check_call(['rm', '-r', 'hdfs_parquet_table_writer'])
> use shutils.rmtree

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Thomas Tauber-Marshall <>
Gerrit-HasComments: Yes

View raw message