impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Jacobs (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4787: Optimize APPX MEDIAN() memory usage
Date Tue, 21 Feb 2017 22:12:11 GMT
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS2, Line 1018: Returns a pointer to one p
> Make this one start with a capital letter. But other ones like begin() shou
I see that you're matching some casing with c++ style iterators, but we do follow the C++
style guide [1], which states you'd use camel case for functions unless it's a simple accessor/mutator,
which can have the lower-cased names.

1: https://google.github.io/styleguide/cppguide.html#Function_Names


http://gerrit.cloudera.org:8080/#/c/6025/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS3, Line 1025: 
2 now


PS3, Line 1076: 
2


http://gerrit.cloudera.org:8080/#/c/6025/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

Line 1121:     int r = rand() % state->num_samples;
> I know you didn't author this line, but could you switch to a better PRNG t
There is a comment in the code about upgrading to mersenne twister, I think we can if it's
now baked into c++11. However, what is the state/struct? Right now we keep the 'ranlux64_3
rng' state on the ResSampState, and we'd need to do the same for mt19937_64. We'd need to
check that is reasonably sized. It might be worth doing that in a separate patch anyway.


PS5, Line 1122: ((double) state->source_size - r) / state->source_size
> I'm not yet convinced this is correct. I'm not convinced it is correct in H
The point of the big comment above is that this is not exact, and we're approximating the
proper sampling in order to avoid maintaining a heap. The point is that this is picking some
number proportional to the size of the source input. I'm sure you can find mathematical flaws
in this approximation, and It's not clear to me how bad the error is, but the original use
of this function was to generate coarse histograms so we really wanted to optimize for that
use case, especially knowing that planner heuristics are only heuristics to begin with. Over
time we've come to want more from this code, so it's probably worth reevaluating whether or
not this approach still makes sense. There's a reason we even call the median function "APPX_MEDIAN".
I don't think we give any guarantees beyond 'it is approximate': https://www.cloudera.com/documentation/enterprise/latest/topics/impala_appx_median.html


-- 
To view, visit http://gerrit.cloudera.org:8080/6025
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple-impala@apache.org>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message