impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Skye Wanderman-Milne (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3441: check for malformed Avro data
Date Thu, 02 Jun 2016 00:55:35 GMT
Skye Wanderman-Milne has posted comments on this change.

Change subject: IMPALA-3441: check for malformed Avro data
......................................................................


Patch Set 5:

(24 comments)

Dan, we could probably use more codegen testing, but by inspection of the code we are testing
the relevant paths. I.e. we don't test failure of every type with codegen'd code, but most
of the codegen'd paths for different types are the same and we test all of them. 

Tim, I've addressed most of your comments, but haven't tried two of your suggestions in hdfs-avro-scanner.cc
(passing data_end instead of data_len and branch weight metadata to the codegen'd code). I'll
work on those next, but I'll post what I have so you can review the other changes in parallel.


Similarly, I'm not sure where we spend the extra time yet, but I'll investigate. I'll have
to do it by removing code since all the changes are in codegen'd code, so I can't profile
normally. ReadZLong would be my guess too.

http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

Line 46:  uint8_t** data
> It may be worth adding a __restrict__ to some of these pointer args. Otherw
It doesn't seem to have an effect, probably because everything's inlined with codegen (or
I did something wrong, I just added __restrict__ to every data and data_len pointer in this
file, lemme know if I should do something else).


Line 60:   if (union_position == 2) union_position = 1;
> Could implement this as a bitshift 1 bit to the right. It's possible the co
Also didn't have an effect on my benchmark, hopefully the compiler is doing this.


Line 69:   if (*data_len == 0) {
> Should probably add UNLIKELY() annotations to these error conditions. Not s
Done


Line 133: 4
> sizeof(float)? or a named constant
Done


Line 146:   }
> sizeof(float)?
Done


Line 154:   if (*data_len < 8) {
> sizeof(double)? or a named constant
Done


Line 161:   }
> sizeof(double)?
Done


Line 170:   if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) {
> Can we factor out the string checks into a separate function? ReadAvroVarCh
Done


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc
File be/src/exec/hdfs-avro-scanner-test.cc:

Line 24: // TODO: CHAR, VARCHAR
> Maybe create a JIRA and elaborate on what coverage is missing if it's not g
Done


Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) {
> This only exercises the special case when the union has two members, one of
Done, also renamed to NullUnionTest


Line 187:   TestReadAvroInt32(data, 0, -1, -1, TErrorCode::AVRO_TRUNCATED_BLOCK);
> Add a test for the upper bounds maybe, e.g. max int and min it?
Done


Line 190:   // TODO: we don't handle invalid values (e.g. overflow)
> Could we add a test to exercise the code path at least? Even if we don't ha
It DCHECKs right now, so it would take some changes to even run this case.


Line 208:   TestReadAvroInt64(data, 10, 64, 2);
> Add a test for the upper bounds maybe, e.g. max int and min it?
Done


Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) {
> There's the funky case where in avro it's a float, but we're expecting a do
Done (it's done in TestReadAvroFloat()). I also added similar coverage for type promotion
of ints and bigints.


Line 218:   *reinterpret_cast<float*>(data) = f;
> memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is o
Done


Line 230:   *reinterpret_cast<double*>(data) = d;
> memcpy?
Done


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

Line 73:     codegend_decode_avro_data_(NULL) {
> DCHECK that it's a test?
Good idea, done (here and in parent ctors)


Line 603:         // TODO: handle return value
> Are you planning to do this in this patchset?
Oops, this is actually super easy to handle, done.

Unfortunately there is no testing for Avro records... I think I meant to do this as follow
on work for nested types, but then that didn't happen. I filed IMPALA-3660 for now, and I'm
gonna try to resolve this ASAP after I finish this patch.


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc
File be/src/exec/read-write-util.cc:

Line 24: int64_t* buf_len, int64_t* result
> Should add a __restrict__ for these args so they can stay in registers. Or 
Adding __restrict__ doesn't seem to have an impact on performance, but I added it anyway.

There probably is some scheme where we can use a fast path and only break out of it after
strings, but I think that will be enough plumbing that we should do it in a later patch (if
we think it's worth it).


Line 29:     if (*buf_len <= 0) return false;
> UNLIKELY()
Done


Line 33:     more = (**buf & 0x80) != 0;
> What is the expected behaviour if the encoded int doesn't fit in a long?
There is none. We don't handle invalid data right now (there's a TODO about this in the test,
I'll also file a JIRA).


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc
File be/src/exec/zigzag-test.cc:

Line 61: // TODO: test error cases
> Let's add some basic tests at least for overflow, etc.
It DCHECKs currently :( I filed https://issues.cloudera.org/browse/IMPALA-3659


http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

Line 86: test_hdfs_rcfile_scan_node_errors
> Missed updating the test name
Done


Line 88:     self.run_test_case('DataErrorsTest/avro-errors', vector)
> I don't see this file
Oops forgot to commit it


-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message