impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files
Date Thu, 07 Jul 2016 22:54:42 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files
......................................................................


Patch Set 1:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS1, Line 382: true
> can this run more than twice? Is there a way to write it without a loop? It
Yes, there are multiple independent reasons why EncodeValue may fail and (though highly unlikely)
its possible the loop will run more than once. I've updated the comment to reflect this.


Line 395:         page_size_ = bytes_needed;
> shouldn't this line be below the check? Otherwise you change page_size_ (wh
Good point. This isn't really new code, this whole block was just copy and pasted for this
review, so its interesting this wasn't a problem before.

I don't see any reason why it shouldn't be safe to make this change, though, so I'll do it.


Line 412:   DCHECK(def_levels_->Put(value != NULL));
> DCHECK is a no-op in release builds, so this would not get executed. That's
Done


http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/parquet-reader.cc
File be/src/util/parquet-reader.cc:

Line 132: // We inherit from RleDecoder to get access to repeat_count_, which
> single line?
Done


Line 139:   uint32_t repeat_count() { return repeat_count_; }
> make const
Done


Line 143: //   - Compressed pages can be uncompressed.
> What does this mean? Would it be an error or is it allowed?
Done


PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number
of
             : //     def levels - with our RLE scheme it is not possible to determine how
many values
             : //     were actually written if the final run is a literal run, only if the
final run is
             : //     a repeated run.
> Is there a way to fix this?
Not without reducing the efficiency of our rle encoder (this behavior has to do with byte
alignment), which doesn't seem worth it just to allow for this testing.


PS1, Line 149: PageHeader header
> const PageHeader& ? If you need the copy, making it explicitly in the code 
Done


PS1, Line 154: malloc
> I cannot see where this gets free'd. Does it leak?
Done


PS1, Line 175: i++
> nit: ++i
Done


PS1, Line 181: ++count
> Isn't count == i?
Done


http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 165:   bool BufferFull() { return buffer_full_; }
> Usually we seem to name these simple getters like the member, i.e. "buffer_
Done


http://gerrit.cloudera.org:8080/#/c/3556/1/tests/query_test/test_writers.py
File tests/query_test/test_writers.py:

Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved.
> Wouldn't this need to be the ASF header?
Done


Line 22:   def test_hdfs_parquet_table_writer(self, vector):
> this test might need an annotation to not be run in parallel (@pytest.mark.
We want to be able to run in parallel, so I'll update it to use unique_database


PS1, Line 24: """
> I think these should be in an own line for multiline docstrings.
We're inconsistent about this - a quick grep through the python tests looks like we put it
on its own line less than half of the time.

I don't have an opinion either way, but it would be nice if we had a python style guide to
follow.


PS1, Line 30: copyToLocal
> Where does this copy to? If it's the cwd you might want to create a tempora
Done


PS1, Line 34: impalad_basedir + '/util/parquet-reader'
> use os.path.join()
Done


PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f)
> os.path.join()
Done


Line 39:       check_call(['rm', '-r', 'hdfs_parquet_table_writer'])
> use shutils.rmtree
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3556
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message