impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4363: Add Parquet timestamp validation
Date Fri, 11 Nov 2016 20:23:34 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4363: Add Parquet timestamp validation
......................................................................


Patch Set 4:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 233:     // TODO: consider adding validation for other types, such as DECIMAL.
Let's file a JIRA instead. You don't need to identify whether there is a bug or not, the JIRA
can say that we should investigate the DECIMAL case (and link to this JIRA).


Line 388:               ReadSlot<IS_DICT_ENCODED>(tuple, pool);
single line if it fits


Line 475:     if (NeedsValidation() && !ValidateSlot(reinterpret_cast<T*>(slot)))
{
The API seems a little clunky because we call a generic ValidateSlot() but then set a type-specific
error message. How would you generalize this for DECIMAL?

Imo it'll make the code easier to follow without baking in the TIMESTAMP assumption here.

Maybe ValidateSlot() could take an ErrorMsg* as output param which would be set when returning
false. Or perhaps you have another idea.


Line 503:     DCHECK(!needs_conversion_);
DCHECK(false)?


Line 513:   /// Sets the slot to NULL if the slot value is invalid, e.g., due to being
Update comment, doesn't set anything


Line 515:   bool ValidateSlot(T* src) {
const function


Line 520:   /// Pull out slow-path Status construction code from ReadRepetitionLevel()/
remove the last part "from ReadRepetitionLevel..." I don't think that's accurate anymore


Line 1084:   void* slot = tuple->GetSlot(tuple_offset_);
just inline into L1087


http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 511:   /// Recursively reads from children_ to assemble a single CollectionValue into
Update comment to reflect API change


http://gerrit.cloudera.org:8080/#/c/4968/4/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 22: 
remove blank lines


Line 150:   inline bool IsValidDate() const {
nice


http://gerrit.cloudera.org:8080/#/c/4968/4/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 312:    "The valid date range is 1400..9999."),
mention the full range with year-month-day to make it clearer


http://gerrit.cloudera.org:8080/#/c/4968/4/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 382:   hadoop fs -put -f ${IMPALA_HOME}/testdata/bad_parquet_data/out-of-range-timestamp.parq
\
Let's avoid doing this and instead create a test table on-the-fly in the test. There's no
need to store this in the snapshot and having the loading separate from the actual test can
cause confusion.


http://gerrit.cloudera.org:8080/#/c/4968/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 248:   def test_zero_rows(self, vector, unique_database):
it would be nice to have a new test test_timestamp_out_of_range that looks like this one

the test_corrput_files doesn't really follow best practices imo, and let's not fix that in
this patch, so adding a new test seems easiest/cleanest

also it's arguable whether these bad timestamp files are really corrupt, the values are certainly
legal INT96, but the interpretation as a timestamp is the problem


-- 
To view, visit http://gerrit.cloudera.org:8080/4968
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message