impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-3764: fuzz test HDFS scanners
Date Wed, 20 Jul 2016 20:21:10 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3764: fuzz test HDFS scanners
......................................................................


Patch Set 5:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 196:     return Status(Substitute("Hit end of stream after reading $0 bytes of $1-byte
"
> Does returning a status here and below honor the abort_on_error flag? I'm f
I confirmed it gets bubbled up to BaseSequenceScanner::ProcessSplit(), which logs the error.

We actually test this with the fuzz testing, since we run with abort_on_error=0 and expect
no exceptions (aside from Parquet, which has a couple of bugs of this nature).


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 79:     DCHECK_GE(footer_start, 0);
> Ok to leave in this patch, but this should not be DCHECK right?
I believe if the above calculation is correct, it should never be < 0 because of the min().


Line 528:         parse_status_.MergeStatus(Substitute("Corrupted parquet file: column had
$0 "
> Parquet (capital)
Done


Line 714:   ScopedBuffer metadata_buffer(scan_node_->mem_tracker());
> move closer to where it's used
I can't move it into the if statement since it needs to live longer than that (it's referenced
via metadata_ptr)


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 97:     return Status(Substitute("Corrupt parquet file: $0 bytes of encoded levels but
only "
> Parquet
Done


Line 929:   if (UNLIKELY(def_level_ < 0 || def_level_ > max_def_level())) {
Updated the message for this one..


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 64:         return Status(Substitute("File $0: metadata is corrupt. Dictionary page "
> Parquet file '0'
Done


Line 84:     return Status(Substitute("File '$0': metadata is corrupt. Column $1 has invalid
"
> Parquet file '0':
Done


Line 341:     // Sanity-check the schema to avoid allocating any absurdly large buffers below.
> remove 'any'
Done


Line 347:         "File is likely corrupt", node->element->num_children));
> remove "File is likely corrupt" (the err msg already mentioned that)
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 41:   static Status ValidateOffsetInFile(const std::string& filename, int column,
> column -> col_idx
Done


Line 149:   /// An arbitrary limit on the number of children per schema node we support used
to
> start a new sentence after "support"
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/runtime/scoped-buffer.h
File be/src/runtime/scoped-buffer.h:

Line 30:     Release();
> single line
Done


Line 57:   inline uint8_t* buffer() { return buffer_; }
> const method
Done


Line 62:   /// The current size of the allocated buffer, if not NULL;
> trailing semicolon
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

Line 78:   static const int MAX_SUPPORTED_BITWIDTH = 32;
> MAX_BITWIDTH is shorter and equally clear (and below)
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

Line 171:     if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 0 bytes");
> buffer_len == 0 is more obvious
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 600: class TestScannersFuzzing(ImpalaTestSuite):
> This is big enough to go into it's own .py file.
Done


Line 602:   BATCH_SIZES = [0, 1, 16, 10000]
> include our default batch size 1024
0 means the default batch size.


Line 649:     random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time()
> If we are going to check in this test and run it in our evergreen jobs I th
We can discuss this a bit in person but my feeling is that it's better to use a random seed
to get the extra coverage, and that if we see breakages with any frequency it's a cultural
problem with inadequate precommit testing of scanner changes.

This is easy to run locally and that it is testing a well-defined part of the codebase (the
scanners), so the onus should be on patch authors to run this as part of pre-commit testing
enough times to be confident that they haven't introduced any bugs.

I see the stress testing and query generator as slightly different since they're harder to
run pre-commit and it's not always obvious which changes may introduce relevant bugs.

If something slips through, I'd prefer the extra coverage we get from running this as part
of all our tests with a random seed. Otherwise the coverage is pretty limited to a small subset
of potential errors. My experience running this has been that you'd catch maybe 50% of bugs
in the first run, 90% from running it for a few iterations, and the remainder running it for
a few hours.

We also want to run this under all the different configs (S3, release, ASAN, etc), so there's
a lot of overhead to set up new jobs for all of those. It might be useful to have a job that
just runs this in a loop, but I think that's a separate thing.


Line 686:     # Execute a query that try to read all the columns and rows in the file.
> tries to read
Done


Line 688:     query = 'select count(*) from (select distinct * from {0}.{1}) q'.format(
> can we also just run a count(*) query to see what happens when we materiali
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3448
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: David Knupp <dknupp@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message