impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Hecht (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Date Sat, 02 Dec 2017 00:20:45 GMT
Dan Hecht has posted comments on this change. ( )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Patch Set 3:

File be/src/exprs/
PS3, Line 1443: bucket_
I don't understand why we have "bucket" as a prefix here.  Both the row_count and hll_state
are per bucket, right?
PS3, Line 1456: existing issues with passing constant arguments to all
              :   /// aggregation phases
what does that mean? is there a JIRA we can just reference?
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
              :     return reinterpret_cast<int64_t*>(GetBucketPtr(bucket_idx));
              :   }
              :   uint8_t* GetHllPtr(int bucket_idx) {
              :     return GetBucketPtr(bucket_idx) + sizeof(int64_t);
              :   }
              :   uint8_t* GetBucketPtr(int bucket_idx) {
              :     return reinterpret_cast<uint8_t*>(this) +
              :         sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
              :   }
given that NUM_HLL_BUCKETS is predetermined, why not just use an array:

struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS];
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T&
I think a quick comment explaining what we're doing here would be helpful, something like

Incorporate the row into one of the intermediate HLLs, which will be used by Finalize to generate
a set of the (x,y) points.
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
I wonder if picking a random bucket is better? We talked a bit about this. Up to you whether
you think it's worth investigating.
PS3, Line 1533:   int64_t counts[num_points];
              :   memset(counts, 0, num_points * sizeof(int64_t));
how about:

int64_t counts[num_points] = { 0 };
PS3, Line 1542: diminishing returns from creating additional data points
not sure what that means. do you mean additional to 'num_points'? i.e. are you saying that
empirically, 'num_points' is a sufficient number of points?
File be/src/exprs/aggregate-functions.h:
PS3, Line 209: x/y
I think using a notation like "(x, y)" maybe clearer since this sounds like divide.
File be/src/util/mpfit-util.h:
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
should any of these members be private?
File be/src/util/
PS3, Line 33: function
but what is this function suppose to do?  Does it make sense to call it a "fitting function"?
File fe/src/main/java/org/apache/impala/analysis/
PS3, Line 519: children_.get(1)
PS3, Line 524: children_.get(1)

why don't we have to do anything with the return value of uncheckedCastTo() (like set it as
the children_ at index 1)?
File tests/query_test/
PS3, Line 337: * sample_perc
even with the comment, I'm not sure what's going on here.
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
a quick comment for this would be helpful.

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dan Hecht <>
Gerrit-Reviewer: Jim Apple <>
Gerrit-Comment-Date: Sat, 02 Dec 2017 00:20:45 +0000
Gerrit-HasComments: Yes

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message