Return-Path: X-Original-To: apmail-impala-dev-archive@minotaur.apache.org Delivered-To: apmail-impala-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D021A19C8F for ; Thu, 28 Apr 2016 03:41:01 +0000 (UTC) Received: (qmail 88401 invoked by uid 500); 28 Apr 2016 03:41:01 -0000 Delivered-To: apmail-impala-dev-archive@impala.apache.org Received: (qmail 88359 invoked by uid 500); 28 Apr 2016 03:41:01 -0000 Mailing-List: contact dev-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list dev@impala.incubator.apache.org Received: (qmail 88348 invoked by uid 99); 28 Apr 2016 03:41:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2016 03:41:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 08821C22C9 for ; Thu, 28 Apr 2016 03:41:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ahi2vcwq2bG3 for ; Thu, 28 Apr 2016 03:40:59 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E545D5F36A for ; Thu, 28 Apr 2016 03:40:58 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id u3S3ewR7007034; Thu, 28 Apr 2016 03:40:58 GMT Message-Id: <201604280340.u3S3ewR7007034@ip-10-146-233-104.ec2.internal> Date: Thu, 28 Apr 2016 03:40:58 +0000 From: "Henry Robinson (Code Review)" To: impala-cr@cloudera.com, dev@impala.incubator.apache.org CC: Marcel Kornacker , Mostafa Mokhtar Reply-To: henry@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?[Impala-CR](cdh5-trunk)_IMPALA-3007:_Adjust_Bloom_Filter_size_according_to_NDV_estimate=0A?= X-Gerrit-Change-Id: I1fe37b8d4cfb3c52bb8e8cf0ca55e92665b87803 X-Gerrit-ChangeURL: X-Gerrit-Commit: d1deb61b24a50d1c96d21c24fc85f30ebf2958de In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.10-rc0 Henry Robinson has posted comments on this change. Change subject: IMPALA-3007: Adjust Bloom Filter size according to NDV estimate ...................................................................... Patch Set 1: (12 comments) http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/exec/hash-join-node.cc File be/src/exec/hash-join-node.cc: Line 230: hash_tbl_->AddBloomFilters(); > huh? Could you be a bit more descriptive? :) All the logic for checking FP rates has gone into AddBloomFilters(). http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/exec/hdfs-scan-node.cc File be/src/exec/hdfs-scan-node.cc: Line 156: uint32_t log_space = state->filter_bank()->GetLogSpaceForNdv(filter.ndv_estimate); > why not have a GetFilterByteSize() or something like that. the scan node sh Done. http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/exec/old-hash-table.cc File be/src/exec/old-hash-table.cc: Line 148: filters_[i]->filter_desc().filter_id); > this takes the capacity, not an id. Done http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: Line 500: state->filter_bank()->FpRateTooHigh(ndv_estimate, total_build_rows); > just as an aside: instead of looking at build rows, which is indirect, why Done http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/runtime/runtime-filter.cc File be/src/runtime/runtime-filter.cc: Line 171: uint64_t required_space = > let's not use unsigned ints Done http://gerrit.cloudera.org:8080/#/c/2812/1/be/src/runtime/runtime-filter.h File be/src/runtime/runtime-filter.h: Line 77: /// expected false-positive rate would be larger than allowed by > "a filter's expected false-positive rate would exceed flags_max_filter_erro Done Line 79: bool FpRateTooHigh(uint64_t expected_ndv, uint64_t observed_ndv); > role of expected_ndv unclear Done Line 94: BloomFilter* AllocateScratchBloomFilter(int64_t ndv_estimate); > instead of continuing to talk about ndv and estimates, which are fe concept Why do you feel the NDV is a FE-only concept? In my opinion it's the key parameter to determining the size of the BF. http://gerrit.cloudera.org:8080/#/c/2812/1/fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java File fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java: Line 415: filter.computeNdvEstimate(); > this also needs to happen for repartitioning joins Done (as a result of the previous patch which refactored this method). http://gerrit.cloudera.org:8080/#/c/2812/1/fe/src/main/java/com/cloudera/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/com/cloudera/impala/planner/RuntimeFilterGenerator.java: Line 113: // Estimate of the number of distinct values that will be inserted into this filter. > explain meaning of -1 Done. Even for repartitioning joins we want the total NDV across all instances, since the filters will be merged. http://gerrit.cloudera.org:8080/#/c/2812/1/testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test File testdata/workloads/functional-query/queries/QueryTest/runtime_filters.test: Line 253: # Test case 11: filters with high expected FP rate get disabled. > what does "expected" mean here? That the rate of false-positives when probing the filter is expected to be high. The actual FP-rate is irrelevant. http://gerrit.cloudera.org:8080/#/c/2812/1/testdata/workloads/functional-query/queries/QueryTest/runtime_filters_wait.test File testdata/workloads/functional-query/queries/QueryTest/runtime_filters_wait.test: Line 37: row_regex: .*0 of 1 Runtime Filters Produced.* > i'm not sure about this error message, it makes it sound like something wen What would you prefer? In a certain sense, something did go wrong (we guessed the right size for a filter, and it was enough of an underestimate that the filter was disabled). -- To view, visit http://gerrit.cloudera.org:8080/2812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1fe37b8d4cfb3c52bb8e8cf0ca55e92665b87803 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Henry Robinson Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Mostafa Mokhtar Gerrit-HasComments: Yes