Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2ECE2200C1E for ; Fri, 17 Feb 2017 23:54:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2D953160B46; Fri, 17 Feb 2017 22:54:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 75BF3160B57 for ; Fri, 17 Feb 2017 23:54:05 +0100 (CET) Received: (qmail 88557 invoked by uid 500); 17 Feb 2017 22:54:04 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 88526 invoked by uid 99); 17 Feb 2017 22:54:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2017 22:54:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D8DDB1A0C86 for ; Fri, 17 Feb 2017 22:54:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mm5vnqtEo4UX for ; Fri, 17 Feb 2017 22:54:01 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AC3135FD3B for ; Fri, 17 Feb 2017 22:54:01 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v1HMs0U0010582; Fri, 17 Feb 2017 22:54:00 GMT Message-Id: <201702172254.v1HMs0U0010582@ip-10-146-233-104.ec2.internal> Date: Fri, 17 Feb 2017 22:54:00 +0000 From: "Taras Bobrovytsky (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Matthew Jacobs , Alex Behm , Marcel Kornacker , Mostafa Mokhtar Reply-To: tbobrovytsky@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4787=3A_Optimize_APPX_MEDIAN=28=29_memory_usage=0A?= X-Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968 X-Gerrit-ChangeURL: X-Gerrit-Commit: fb73d366d418b37d7309f5ecaaaf5b0ecf8c1077 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Fri, 17 Feb 2017 22:54:06 -0000 Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage ...................................................................... Patch Set 2: (17 comments) Marcel, I will run the targeted perf query later today. http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: PS2, Line 963: max_num_samples > confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity Done PS2, Line 973: { return begin() + idx; } > it'd be good to add bounds checks w/ dchecks Done PS2, Line 984: max_num_samples > capacity Done Line 996: bool IsFull() const { > dcheck that num_samples <= capacity, i.e. that it is NOT over Done PS2, Line 990: size_t byte_size() const { : return byte_size(max_num_samples); : } : : // Returns true if the array of ReservoirSamples that follows this struct in memory is : // full, and no more elements can be pushed back without resizing. : bool IsFull() const { : return num_samples == max_num_samples; : } > i think both of these can be one line One lined one of them, added a DCHECK to the other. PS2, Line 1001: // make this const > ? Done PS2, Line 1018: resizeReservoirSampleState > nit: inconsistent casing in fn names Make this one start with a capital letter. But other ones like begin() should probably start with a lowercase. PS2, Line 1021: N_ > NUM Done PS2, Line 1024: new_size > new_capacity to avoid conflating with memory sizes Done PS2, Line 1024: new_size > Queries with a large number buckets will run out of memory much quicker tha Done PS2, Line 1024: int new_size = state->max_num_samples * 10; : DCHECK_EQ(state->max_num_samples, state->num_samples); : DCHECK_LE(new_size, MAX_NUM_SAMPLES); > Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't th In this implementation it couldn't DCHECK, because we would start with capacity 20, then go to 200, 2000, and finally 20000. This is now changed because Mostafa suggested doubling instead. PS2, Line 1036: state->max_num_samples = new_size; > why don't any other fields need to be initialized? e.g. the rng will be gar The previous state gets copied in ReallocBuffer. The only thing we need to change is is the capacity. Added comment. PS2, Line 1048: size > again I'd vote for capacity I got rid of this variable. PS2, Line 1058: *state = ReservoirSampleState(); > this looks like it'll initialize the memory properly. I think we'd probably We don't need to do that on line 1035 because everything gets copied over in ReallocBuffer() PS2, Line 1084: ++state->source_size; > I think this gets lost when you allocate a new State. As mentioned above, everything is copied in realloc PS2, Line 1152: ReservoirSampleMerge > I think this algorithm will require some changes to handle varying length S The capacity shouldn't matter in this algorithm. This algorithm only cares about num samples in src and dst states. PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES > I don't think you can rely on this anymore, dst may not have capacity MAX_N If DST does not have enough capacity for MAX_NUM_SAMPLES, it will be resized on line 1170 -- To view, visit http://gerrit.cloudera.org:8080/6025 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes