impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-3441: check for malformed Avro data
Date Tue, 31 May 2016 22:10:10 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3441: check for malformed Avro data

Patch Set 5:


It's looking pretty good, I think mainly we could add (even more) test coverage, then I'm
concerned about getting back some or all of the perf regression. I had some ideas that I mentioned
File be/src/exec/

Line 46:  uint8_t** data
It may be worth adding a __restrict__ to some of these pointer args. Otherwise it won't be
able to cache *data_len in a register, since the write to *data may overwrite *data_len.

Line 60:   if (union_position == 2) union_position = 1;
Could implement this as a bitshift 1 bit to the right. It's possible the compiler already
does this.

Line 69:   if (*data_len == 0) {
Should probably add UNLIKELY() annotations to these error conditions. Not sure how much of
a difference it will make.

Line 133: 4
sizeof(float)? or a named constant

Line 146:   }

Line 154:   if (*data_len < 8) {
sizeof(double)? or a named constant

Line 161:   }

Line 170:   if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) {
Can we factor out the string checks into a separate function? ReadAvroVarChar(), ReadAvroChar()
and ReadAvroString() all share the same logic.
File be/src/exec/

Maybe create a JIRA and elaborate on what coverage is missing if it's not going to be included
in the patch.

Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) {
This only exercises the special case when the union has two members, one of which is null,
right? Maybe add a short comment to clarify.

Line 187:   TestReadAvroInt32(data, 0, -1, -1, TErrorCode::AVRO_TRUNCATED_BLOCK);
Add a test for the upper bounds maybe, e.g. max int and min it?

Line 190:   // TODO: we don't handle invalid values (e.g. overflow)
Could we add a test to exercise the code path at least? Even if we don't have a true "expected"

Line 208:   TestReadAvroInt64(data, 10, 64, 2);
Add a test for the upper bounds maybe, e.g. max int and min it?

Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) {
There's the funky case where in avro it's a float, but we're expecting a double: can you add
coverage for that?

Line 218:   *reinterpret_cast<float*>(data) = f;
memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is ok.

Line 230:   *reinterpret_cast<double*>(data) = d;
File be/src/exec/

Line 73:     codegend_decode_avro_data_(NULL) {
DCHECK that it's a test?

Line 536:     MemPool* pool, uint8_t** data, int64_t* data_len, Tuple* tuple) {
I think it might be slightly better to provide a data_end pointer instead of a data_len, since
it means one fewer mutable argument and I think will work out to slightly fewer instructions.
I don't feel strongly about it though, I'm not sure it's worth going through the effort of
changing it everywhere.

Line 603:         // TODO: handle return value
Are you planning to do this in this patchset?

Line 666: //   br i1 %success, label %end_field, label %bail_out
It may be worth adding branch-weight metadata to these branches (the equivalent of UNLIKELY
annotations). That may help get back some of the perf regression. It looks like there's support
for it: . In LowerExpectIntrinsic.cpp there's
some usage of it that looks relatively straightforward:

  MDBuilder MDB(CI->getContext());
  MDNode *Node;

  // If expect value is equal to 1 it means that we are more likely to take
  // branch 0, in other case more likely is branch 1.
  if (ExpectedValue->isOne())
    Node = MDB.createBranchWeights(LikelyBranchWeight, UnlikelyBranchWeight);
    Node = MDB.createBranchWeights(UnlikelyBranchWeight, LikelyBranchWeight);

  BI.setMetadata(LLVMContext::MD_prof, Node);
File be/src/exec/

Line 24: int64_t* buf_len, int64_t* result
Should add a __restrict__ for these args so they can stay in registers. Or alternatively explicitly
cache them in locals.

We could also consider adding a fast path where we check whether we have enough buffer remaining
to always read the full integer. I'm not sure if there's much benefit to doing this per-field.
Maybe if we could do the check once-per-tuple, but I guess that doesn't work if you have var-len
data like strings.

Line 29:     if (*buf_len <= 0) return false;

Line 33:     more = (**buf & 0x80) != 0;
What is the expected behaviour if the encoded int doesn't fit in a long?
File be/src/exec/

Line 61: // TODO: test error cases
Let's add some basic tests at least for overflow, etc.
File tests/data_errors/

Line 86: test_hdfs_rcfile_scan_node_errors
Missed updating the test name

Line 88:     self.run_test_case('DataErrorsTest/avro-errors', vector)
I don't see this file

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dan Hecht <>
Gerrit-Reviewer: Skye Wanderman-Milne <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message