impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Skye Wanderman-Milne (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) PREVIEW IMPALA-3441: check for malformed Avro data
Date Tue, 17 May 2016 00:49:27 GMT
Skye Wanderman-Milne has uploaded a new patch set (#3).

Change subject: PREVIEW IMPALA-3441: check for malformed Avro data
......................................................................

PREVIEW IMPALA-3441: check for malformed Avro data

This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following query:
  set num_scanner_threads=1;
  select max(i) from default.avro_ints_big;

where avro_ints_big is an Avro table with a single int column
containing ~90MM values. With this patch, the total query time goes
from 1.6s to X.Xs (XX% increase), with the MaterializeTupleTime going
from 975ms to XXXXms (XX% increase).

TODO:
- I plan to write unit tests for most of these cases, and one or
  two malformed files for end-to-end tests. It's too hard to exercise
  all these cases with end-to-end tests.
- Perf numbers / improvements

Tests ran:
- ./run-tests.py query_test/test_scanners.py --table_formats avro/snap
- Ad-hoc query on a malformed file I generated. Gives the error:
    File '...' is corrupt: invalid union value 99 at offset 234

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
---
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/read-write-util.cc
M be/src/exec/read-write-util.h
M common/thrift/generate_error_codes.py
6 files changed, 271 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/72/3072/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <skye@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <skye@cloudera.com>

Mime
View raw message