impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4252: Min-max runtime filters for Kudu
Date Thu, 12 Oct 2017 00:13:34 GMT
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7793 )

Change subject: IMPALA-4252: Min-max runtime filters for Kudu
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7793/6//COMMIT_MSG@15
PS6, Line 15: them into the Kudu client. Because the Kudu client doesn't provide a
Is there any documentation for Kudu about what ordering Kudu uses for comparisons for different
types? I guess we already push filters so we're already depending on the orderings being the
same as Impala, but I'd be interested to know. 

We had a similar discussion about Parquet and the spec got clarified a lot as a result (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
now specifies it). Parquet-MR also had various odd behaviour that got shaken out as a result
of this.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@106
PS6, Line 106: bloom_filter
has_bloom_filter() ? At callsites it reads like this should return a filter rather than a
bool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/exec/filter-context.h@109
PS6, Line 109: min_max_filter
has_min_max_filter?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/coordinator.cc@1220
PS6, Line 1220:       MinMaxFilter::Or(src_type(), params.min_max_filter, min_max_filter_.get());
I mentioned this in a later comment, but we should consider what happens if the min/max values
are very large - maybe thre's some memory management similar to above.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc
File be/src/runtime/runtime-filter-bank.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-bank.cc@235
PS6, Line 235:   MinMaxFilter* min_max_filter = MinMaxFilter::Create(type, &obj_pool_);
(nit) - could combine into one line.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/runtime/runtime-filter-ir.cc@27
PS6, Line 27:   // to a) the atomicity of / pointer assignments and b) the x86 TSO memory
model.
Comment not relevant any more since we're using actual atomics?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc
File be/src/util/min-max-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter-ir.cc@53
PS6, Line 53:       min_str = string(value->ptr, value->len);
We should think through the behaviour with very large strings. It may make sense to disable
the filters if the strings are too large instead of allocating arbitrarily large amounts of
memory.

Maybe we could also do some trickery with truncating the strings to avoid handling the filters
not existing. E.g. if we have a 4 byte limit on min/max, replace min="aaaaaaa" with min="aaaa"
and max="zzzzzz" with max="zzz(". Interested to hear what you think - might be too clever
but seems like it could work.

It would be good to also allocate tracked memory for the string. The Parquet ColumnStats class
solves a similar problem using StringBuffers that allocate from a MemPool.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h
File be/src/util/min-max-filter.h:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.h@83
PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE)                         \
Yeah, the macro seems like it may be the least of the evils.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@77
PS6, Line 77:   std::string NAME##MinMaxFilter::DebugString() const {                    
   \
nit: don't need std:: prefix in .cc files.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@139
PS6, Line 139:       out->min.__set_string_val(std::min(in.min.string_val, out->min.string_val));
I feel like we should use our StringValue::Compare() function instead of relying on std::string's
comparison function. I'm not sure if they are exactly the same but if we compare these the
same way as we compare StringValue's it's easier to reason about.


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@291
PS6, Line 291:       BigIntMinMaxFilter::Or(in, out);
Shouldn't this be calling the timestamp method?


http://gerrit.cloudera.org:8080/#/c/7793/6/be/src/util/min-max-filter.cc@327
PS6, Line 327:   }
Timestamp?


http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@379
PS6, Line 379:         if (slotRef == null || slotRef.getDesc().getColumn() == null
Are there planner tests for all these cases?


http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test
File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test:

http://gerrit.cloudera.org:8080/#/c/7793/6/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test@30
PS6, Line 30: kudu
Can we make this something like table_format=kudu so that it's a bit more self-describing
and future-proof?


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/7793/6/tests/query_test/test_runtime_filters.py@61
PS6, Line 61:       lambda v: v.get_value('table_format').file_format not in ['hbase', 'kudu'])
nit: 4 character indents on line continuations (I guess this is preserved from the existing
code).


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py
File tests/util/test_file_parser.py:

http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@252
PS6, Line 252:           allowed_formats = ['kudu']
Thanks for adding this validation, hopefully will lead people in the right direction if they
need to generalise this.


http://gerrit.cloudera.org:8080/#/c/7793/6/tests/util/test_file_parser.py@254
PS6, Line 254: hint
hint? Isn't it a table format?



-- 
To view, visit http://gerrit.cloudera.org:8080/7793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Gerrit-Change-Number: 7793
Gerrit-PatchSet: 6
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjacobs@apache.org>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:13:34 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message