impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hecht (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Date Sat, 02 Dec 2017 00:20:45 GMT
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443
PS3, Line 1443: bucket_
I don't understand why we have "bucket" as a prefix here.  Both the row_count and hll_state
are per bucket, right?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456
PS3, Line 1456: existing issues with passing constant arguments to all
              :   /// aggregation phases
what does that mean? is there a JIRA we can just reference?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
              :     return reinterpret_cast<int64_t*>(GetBucketPtr(bucket_idx));
              :   }
              : 
              :   uint8_t* GetHllPtr(int bucket_idx) {
              :     return GetBucketPtr(bucket_idx) + sizeof(int64_t);
              :   }
              : 
              :   uint8_t* GetBucketPtr(int bucket_idx) {
              :     return reinterpret_cast<uint8_t*>(this) +
              :         sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
              :   }
given that NUM_HLL_BUCKETS is predetermined, why not just use an array:

struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS];


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T&
src,
I think a quick comment explaining what we're doing here would be helpful, something like

Incorporate the row into one of the intermediate HLLs, which will be used by Finalize to generate
a set of the (x,y) points.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
I wonder if picking a random bucket is better? We talked a bit about this. Up to you whether
you think it's worth investigating.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533
PS3, Line 1533:   int64_t counts[num_points];
              :   memset(counts, 0, num_points * sizeof(int64_t));
how about:

int64_t counts[num_points] = { 0 };


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542
PS3, Line 1542: diminishing returns from creating additional data points
not sure what that means. do you mean additional to 'num_points'? i.e. are you saying that
empirically, 'num_points' is a sufficient number of points?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209
PS3, Line 209: x/y
I think using a notation like "(x, y)" maybe clearer since this sounds like divide.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
should any of these members be private?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
but what is this function suppose to do?  Does it make sense to call it a "fitting function"?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519
PS3, Line 519: children_.get(1)
samplePerc?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
samplePerc?

why don't we have to do anything with the return value of uncheckedCastTo() (like set it as
the children_ at index 1)?


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337
PS3, Line 337: * sample_perc
even with the comment, I'm not sure what's going on here.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
a quick comment for this would be helpful.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bharathv@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Jim Apple <jbapple-impala@apache.org>
Gerrit-Comment-Date: Sat, 02 Dec 2017 00:20:45 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message