impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3441: check for malformed Avro data
Date Tue, 31 May 2016 22:10:10 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3441: check for malformed Avro data
......................................................................


Patch Set 5:

(26 comments)

It's looking pretty good, I think mainly we could add (even more) test coverage, then I'm
concerned about getting back some or all of the perf regression. I had some ideas that I mentioned
inline.

http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

Line 46:  uint8_t** data
It may be worth adding a __restrict__ to some of these pointer args. Otherwise it won't be
able to cache *data_len in a register, since the write to *data may overwrite *data_len.


Line 60:   if (union_position == 2) union_position = 1;
Could implement this as a bitshift 1 bit to the right. It's possible the compiler already
does this.


Line 69:   if (*data_len == 0) {
Should probably add UNLIKELY() annotations to these error conditions. Not sure how much of
a difference it will make.


Line 133: 4
sizeof(float)? or a named constant


Line 146:   }
sizeof(float)?


Line 154:   if (*data_len < 8) {
sizeof(double)? or a named constant


Line 161:   }
sizeof(double)?


Line 170:   if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) {
Can we factor out the string checks into a separate function? ReadAvroVarChar(), ReadAvroChar()
and ReadAvroString() all share the same logic.


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc
File be/src/exec/hdfs-avro-scanner-test.cc:

Line 24: // TODO: CHAR, VARCHAR
Maybe create a JIRA and elaborate on what coverage is missing if it's not going to be included
in the patch.


Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) {
This only exercises the special case when the union has two members, one of which is null,
right? Maybe add a short comment to clarify.


Line 187:   TestReadAvroInt32(data, 0, -1, -1, TErrorCode::AVRO_TRUNCATED_BLOCK);
Add a test for the upper bounds maybe, e.g. max int and min it?


Line 190:   // TODO: we don't handle invalid values (e.g. overflow)
Could we add a test to exercise the code path at least? Even if we don't have a true "expected"
value.


Line 208:   TestReadAvroInt64(data, 10, 64, 2);
Add a test for the upper bounds maybe, e.g. max int and min it?


Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) {
There's the funky case where in avro it's a float, but we're expecting a double: can you add
coverage for that?


Line 218:   *reinterpret_cast<float*>(data) = f;
memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is ok.


Line 230:   *reinterpret_cast<double*>(data) = d;
memcpy?


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

Line 73:     codegend_decode_avro_data_(NULL) {
DCHECK that it's a test?


Line 536:     MemPool* pool, uint8_t** data, int64_t* data_len, Tuple* tuple) {
I think it might be slightly better to provide a data_end pointer instead of a data_len, since
it means one fewer mutable argument and I think will work out to slightly fewer instructions.
I don't feel strongly about it though, I'm not sure it's worth going through the effort of
changing it everywhere.


Line 603:         // TODO: handle return value
Are you planning to do this in this patchset?


Line 666: //   br i1 %success, label %end_field, label %bail_out
It may be worth adding branch-weight metadata to these branches (the equivalent of UNLIKELY
annotations). That may help get back some of the perf regression. It looks like there's support
for it: http://llvm.org/docs/BranchWeightMetadata.html . In LowerExpectIntrinsic.cpp there's
some usage of it that looks relatively straightforward:


  MDBuilder MDB(CI->getContext());
  MDNode *Node;

  // If expect value is equal to 1 it means that we are more likely to take
  // branch 0, in other case more likely is branch 1.
  if (ExpectedValue->isOne())
    Node = MDB.createBranchWeights(LikelyBranchWeight, UnlikelyBranchWeight);
  else
    Node = MDB.createBranchWeights(UnlikelyBranchWeight, LikelyBranchWeight);

  BI.setMetadata(LLVMContext::MD_prof, Node);


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc
File be/src/exec/read-write-util.cc:

Line 24: int64_t* buf_len, int64_t* result
Should add a __restrict__ for these args so they can stay in registers. Or alternatively explicitly
cache them in locals.

We could also consider adding a fast path where we check whether we have enough buffer remaining
to always read the full integer. I'm not sure if there's much benefit to doing this per-field.
Maybe if we could do the check once-per-tuple, but I guess that doesn't work if you have var-len
data like strings.


Line 29:     if (*buf_len <= 0) return false;
UNLIKELY()


Line 33:     more = (**buf & 0x80) != 0;
What is the expected behaviour if the encoded int doesn't fit in a long?


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc
File be/src/exec/zigzag-test.cc:

Line 61: // TODO: test error cases
Let's add some basic tests at least for overflow, etc.


http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

Line 86: test_hdfs_rcfile_scan_node_errors
Missed updating the test name


Line 88:     self.run_test_case('DataErrorsTest/avro-errors', vector)
I don't see this file


-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message