Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9C188200D29 for ; Thu, 12 Oct 2017 02:13:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9A987160BE3; Thu, 12 Oct 2017 00:13:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B98B11609E5 for ; Thu, 12 Oct 2017 02:13:39 +0200 (CEST) Received: (qmail 15554 invoked by uid 500); 12 Oct 2017 00:13:39 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 15543 invoked by uid 99); 12 Oct 2017 00:13:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Oct 2017 00:13:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E46F7C0DA4 for ; Thu, 12 Oct 2017 00:13:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.362 X-Spam-Level: ** X-Spam-Status: No, score=2.362 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id bhMFHtcZrEHZ for ; Thu, 12 Oct 2017 00:13:36 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 43F335FBE9 for ; Thu, 12 Oct 2017 00:13:36 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v9C0DYhL015805; Thu, 12 Oct 2017 00:13:34 GMT Message-Id: <201710120013.v9C0DYhL015805@ip-10-146-233-104.ec2.internal> X-Gerrit-PatchSet: 6 Date: Thu, 12 Oct 2017 00:13:34 +0000 From: "Tim Armstrong (Code Review)" To: Thomas Tauber-Marshall , impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Matthew Jacobs , Michael Ho , Mostafa Mokhtar , Lars Volker X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4252=3A_Min-max_runtime_filters_for_Kudu=0A?= X-Gerrit-Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c X-Gerrit-Change-Number: 7793 X-Gerrit-ChangeURL: X-Gerrit-Commit: f4c762e2e47c89ae74aa6282c0dca401eb0b1e66 In-Reply-To: References: X-Gerrit-Comment-Date: Thu, 12 Oct 2017 00:13:34 +0000 Reply-To: tarmstrong@cloudera.com, impala-cr@cloudera.com, lv@cloudera.com, marcelk@gmail.com, kwho@cloudera.com, mmokhtar@cloudera.com, tmarshall@cloudera.com, reviews@impala.incubator.apache.org, mjacobs@apache.org MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.14.2 Content-Type: multipart/alternative; boundary="ErzotV+v8rM="; charset=UTF-8 archived-at: Thu, 12 Oct 2017 00:13:40 -0000 --ErzotV+v8rM= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Tim Armstrong has posted comments on this change=2E ( http://gerrit=2Ecloud= era=2Eorg:8080/7793 ) Change subject: IMPALA-4252: Min-max runtime filters= for Kudu =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E= =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E= =2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E=2E = Patch Set 6: (17 comments) http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6= //COMMIT_MSG Commit Message: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/= 6//COMMIT_MSG@15 PS6, Line 15: them into the Kudu client=2E Because the Kud= u client doesn't provide a Is there any documentation for Kudu about what o= rdering Kudu uses for comparisons for different types? I guess we already p= ush filters so we're already depending on the orderings being the same as I= mpala, but I'd be interested to know=2E We had a similar discussion about= Parquet and the spec got clarified a lot as a result (https://github=2Ecom= /apache/parquet-format/blob/master/LogicalTypes=2Emd now specifies it)=2E P= arquet-MR also had various odd behaviour that got shaken out as a result of= this=2E http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/exec/filte= r-context=2Eh File be/src/exec/filter-context=2Eh: http://gerrit=2Eclouder= a=2Eorg:8080/#/c/7793/6/be/src/exec/filter-context=2Eh@106 PS6, Line 106: b= loom_filter has_bloom_filter() ? At callsites it reads like this should ret= urn a filter rather than a bool=2E http://gerrit=2Ecloudera=2Eorg:8080/#/= c/7793/6/be/src/exec/filter-context=2Eh@109 PS6, Line 109: min_max_filter h= as_min_max_filter? http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/= runtime/coordinator=2Ecc File be/src/runtime/coordinator=2Ecc: http://gerr= it=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/runtime/coordinator=2Ecc@1220 PS= 6, Line 1220: MinMaxFilter::Or(src_type(), params=2Emin_max_filter, m= in_max_filter_=2Eget()); I mentioned this in a later comment, but we should= consider what happens if the min/max values are very large - maybe thre's = some memory management similar to above=2E http://gerrit=2Ecloudera=2Eorg= :8080/#/c/7793/6/be/src/runtime/runtime-filter-bank=2Ecc File be/src/runtim= e/runtime-filter-bank=2Ecc: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6= /be/src/runtime/runtime-filter-bank=2Ecc@235 PS6, Line 235: MinMaxFilter*= min_max_filter =3D MinMaxFilter::Create(type, &obj_pool_); (nit) - could c= ombine into one line=2E http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be= /src/runtime/runtime-filter-ir=2Ecc File be/src/runtime/runtime-filter-ir= =2Ecc: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/runtime/runti= me-filter-ir=2Ecc@27 PS6, Line 27: // to a) the atomicity of / pointer as= signments and b) the x86 TSO memory model=2E Comment not relevant any more = since we're using actual atomics? http://gerrit=2Ecloudera=2Eorg:8080/#/c= /7793/6/be/src/util/min-max-filter-ir=2Ecc File be/src/util/min-max-filter-= ir=2Ecc: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/util/min-ma= x-filter-ir=2Ecc@53 PS6, Line 53: min_str =3D string(value->ptr, valu= e->len); We should think through the behaviour with very large strings=2E I= t may make sense to disable the filters if the strings are too large instea= d of allocating arbitrarily large amounts of memory=2E Maybe we could also= do some trickery with truncating the strings to avoid handling the filters= not existing=2E E=2Eg=2E if we have a 4 byte limit on min/max, replace min= =3D"aaaaaaa" with min=3D"aaaa" and max=3D"zzzzzz" with max=3D"zzz("=2E Inte= rested to hear what you think - might be too clever but seems like it could= work=2E It would be good to also allocate tracked memory for the string= =2E The Parquet ColumnStats class solves a similar problem using StringBuff= ers that allocate from a MemPool=2E http://gerrit=2Ecloudera=2Eorg:8080/#= /c/7793/6/be/src/util/min-max-filter=2Eh File be/src/util/min-max-filter=2E= h: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/util/min-max-filt= er=2Eh@83 PS6, Line 83: #define NUMERIC_MIN_MAX_FILTER(NAME, TYPE) = \ Yeah, the macro seems like it may be the least of the evi= ls=2E http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/util/min-max-= filter=2Ecc File be/src/util/min-max-filter=2Ecc: http://gerrit=2Ecloudera= =2Eorg:8080/#/c/7793/6/be/src/util/min-max-filter=2Ecc@77 PS6, Line 77: s= td::string NAME##MinMaxFilter::DebugString() const { = \ nit: don't need std:: prefix in =2Ecc files=2E http://gerrit=2Eclouder= a=2Eorg:8080/#/c/7793/6/be/src/util/min-max-filter=2Ecc@139 PS6, Line 139: = out->min=2E__set_string_val(std::min(in=2Emin=2Estring_val, out->min= =2Estring_val)); I feel like we should use our StringValue::Compare() funct= ion instead of relying on std::string's comparison function=2E I'm not sure= if they are exactly the same but if we compare these the same way as we co= mpare StringValue's it's easier to reason about=2E http://gerrit=2Ecloude= ra=2Eorg:8080/#/c/7793/6/be/src/util/min-max-filter=2Ecc@291 PS6, Line 291:= BigIntMinMaxFilter::Or(in, out); Shouldn't this be calling the times= tamp method? http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/be/src/util/m= in-max-filter=2Ecc@327 PS6, Line 327: } Timestamp? http://gerrit=2Eclou= dera=2Eorg:8080/#/c/7793/6/fe/src/main/java/org/apache/impala/planner/Runti= meFilterGenerator=2Ejava File fe/src/main/java/org/apache/impala/planner/Ru= ntimeFilterGenerator=2Ejava: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/= 6/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator=2Ejava@= 379 PS6, Line 379: if (slotRef =3D=3D null || slotRef=2EgetDesc()= =2EgetColumn() =3D=3D null Are there planner tests for all these cases? h= ttp://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/testdata/workloads/functional= -query/queries/QueryTest/runtime_filters=2Etest File testdata/workloads/fun= ctional-query/queries/QueryTest/runtime_filters=2Etest: http://gerrit=2Ecl= oudera=2Eorg:8080/#/c/7793/6/testdata/workloads/functional-query/queries/Qu= eryTest/runtime_filters=2Etest@30 PS6, Line 30: kudu Can we make this somet= hing like table_format=3Dkudu so that it's a bit more self-describing and f= uture-proof? http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/tests/query_t= est/test_runtime_filters=2Epy File tests/query_test/test_runtime_filters=2E= py: http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/tests/query_test/test_r= untime_filters=2Epy@61 PS6, Line 61: lambda v: v=2Eget_value('table_f= ormat')=2Efile_format not in ['hbase', 'kudu']) nit: 4 character indents on= line continuations (I guess this is preserved from the existing code)=2E = http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/tests/util/test_file_parser= =2Epy File tests/util/test_file_parser=2Epy: http://gerrit=2Ecloudera=2Eor= g:8080/#/c/7793/6/tests/util/test_file_parser=2Epy@252 PS6, Line 252: = allowed_formats =3D ['kudu'] Thanks for adding this validation, hopefu= lly will lead people in the right direction if they need to generalise this= =2E http://gerrit=2Ecloudera=2Eorg:8080/#/c/7793/6/tests/util/test_file_p= arser=2Epy@254 PS6, Line 254: hint hint? Isn't it a table format? -- To= view, visit http://gerrit=2Ecloudera=2Eorg:8080/7793 To unsubscribe, visit= http://gerrit=2Ecloudera=2Eorg:8080/settings Gerrit-Project: Impala-ASF G= errit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I02bad89= 0f5b5f78388a3041bf38f89369b5e2f1c Gerrit-Change-Number: 7793 Gerrit-PatchSe= t: 6 Gerrit-Owner: Thomas Tauber-Marshall Gerrit= -Reviewer: Lars Volker Gerrit-Reviewer: Matthew Jacobs = Gerrit-Reviewer: Michael Ho Ge= rrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: T= homas Tauber-Marshall Gerrit-Reviewer: Tim Armst= rong Gerrit-Comment-Date: Thu, 12 Oct 2017 00:1= 3:34 +0000 Gerrit-HasComments: Yes --ErzotV+v8rM=--