impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-3764: fuzz test HDFS scanners
Date Wed, 20 Jul 2016 20:21:10 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3764: fuzz test HDFS scanners

Patch Set 5:

File be/src/exec/

Line 196:     return Status(Substitute("Hit end of stream after reading $0 bytes of $1-byte
> Does returning a status here and below honor the abort_on_error flag? I'm f
I confirmed it gets bubbled up to BaseSequenceScanner::ProcessSplit(), which logs the error.

We actually test this with the fuzz testing, since we run with abort_on_error=0 and expect
no exceptions (aside from Parquet, which has a couple of bugs of this nature).
File be/src/exec/

Line 79:     DCHECK_GE(footer_start, 0);
> Ok to leave in this patch, but this should not be DCHECK right?
I believe if the above calculation is correct, it should never be < 0 because of the min().

Line 528:         parse_status_.MergeStatus(Substitute("Corrupted parquet file: column had
$0 "
> Parquet (capital)

Line 714:   ScopedBuffer metadata_buffer(scan_node_->mem_tracker());
> move closer to where it's used
I can't move it into the if statement since it needs to live longer than that (it's referenced
via metadata_ptr)
File be/src/exec/

Line 97:     return Status(Substitute("Corrupt parquet file: $0 bytes of encoded levels but
only "
> Parquet

Line 929:   if (UNLIKELY(def_level_ < 0 || def_level_ > max_def_level())) {
Updated the message for this one..
File be/src/exec/

Line 64:         return Status(Substitute("File $0: metadata is corrupt. Dictionary page "
> Parquet file '0'

Line 84:     return Status(Substitute("File '$0': metadata is corrupt. Column $1 has invalid
> Parquet file '0':

Line 341:     // Sanity-check the schema to avoid allocating any absurdly large buffers below.
> remove 'any'

Line 347:         "File is likely corrupt", node->element->num_children));
> remove "File is likely corrupt" (the err msg already mentioned that)
File be/src/exec/parquet-metadata-utils.h:

Line 41:   static Status ValidateOffsetInFile(const std::string& filename, int column,
> column -> col_idx

Line 149:   /// An arbitrary limit on the number of children per schema node we support used
> start a new sentence after "support"
File be/src/runtime/scoped-buffer.h:

Line 30:     Release();
> single line

Line 57:   inline uint8_t* buffer() { return buffer_; }
> const method

Line 62:   /// The current size of the allocated buffer, if not NULL;
> trailing semicolon
File be/src/util/bit-stream-utils.h:

Line 78:   static const int MAX_SUPPORTED_BITWIDTH = 32;
> MAX_BITWIDTH is shorter and equally clear (and below)
File be/src/util/dict-encoding.h:

Line 171:     if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 0 bytes");
> buffer_len == 0 is more obvious
File tests/query_test/

Line 600: class TestScannersFuzzing(ImpalaTestSuite):
> This is big enough to go into it's own .py file.

Line 602:   BATCH_SIZES = [0, 1, 16, 10000]
> include our default batch size 1024
0 means the default batch size.

Line 649:     random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time()
> If we are going to check in this test and run it in our evergreen jobs I th
We can discuss this a bit in person but my feeling is that it's better to use a random seed
to get the extra coverage, and that if we see breakages with any frequency it's a cultural
problem with inadequate precommit testing of scanner changes.

This is easy to run locally and that it is testing a well-defined part of the codebase (the
scanners), so the onus should be on patch authors to run this as part of pre-commit testing
enough times to be confident that they haven't introduced any bugs.

I see the stress testing and query generator as slightly different since they're harder to
run pre-commit and it's not always obvious which changes may introduce relevant bugs.

If something slips through, I'd prefer the extra coverage we get from running this as part
of all our tests with a random seed. Otherwise the coverage is pretty limited to a small subset
of potential errors. My experience running this has been that you'd catch maybe 50% of bugs
in the first run, 90% from running it for a few iterations, and the remainder running it for
a few hours.

We also want to run this under all the different configs (S3, release, ASAN, etc), so there's
a lot of overhead to set up new jobs for all of those. It might be useful to have a job that
just runs this in a loop, but I think that's a separate thing.

Line 686:     # Execute a query that try to read all the columns and rows in the file.
> tries to read

Line 688:     query = 'select count(*) from (select distinct * from {0}.{1}) q'.format(
> can we also just run a count(*) query to see what happens when we materiali

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: David Knupp <>
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message