impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Jacobs (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4787: Optimize APPX MEDIAN() memory usage
Date Fri, 17 Feb 2017 01:57:07 GMT
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage

Patch Set 2:


this probably needs some new test cases, especially cases where there are multiple pre-aggs
where, e.g. one has very few rows and another has 20k or more rows. I think this may break
the merge. This might be doable by having a table with files of significantly varying sizes.
File be/src/exprs/

Line 974: 
> Done
it'd be good to add some bounds dchecks though
File be/src/exprs/

PS2, Line 963: max_num_samples
confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity

PS2, Line 973: { return begin() + idx; }
it'd be good to add bounds checks w/ dchecks

PS2, Line 984: max_num_samples

Line 996:   bool IsFull() const {
dcheck that num_samples <= capacity, i.e. that it is NOT over

PS2, Line 990:   size_t byte_size() const {
             :     return byte_size(max_num_samples);
             :   }
             :   // Returns true if the array of ReservoirSamples that follows this struct
in memory is
             :   // full, and no more elements can be pushed back without resizing.
             :   bool IsFull() const {
             :     return num_samples == max_num_samples;
             :   }
i think both of these can be one line

PS2, Line 1001: // make this const

PS2, Line 1018: resizeReservoirSampleState
nit: inconsistent casing in fn names

PS2, Line 1021: N_

PS2, Line 1024: new_size 
new_capacity to avoid conflating with memory sizes

PS2, Line 1024:   int new_size = state->max_num_samples * 10;
              :   DCHECK_EQ(state->max_num_samples, state->num_samples);
              :   DCHECK_LE(new_size, MAX_NUM_SAMPLES);
Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't this be taking the
min() or something like that?

PS2, Line 1036:   state->max_num_samples = new_size;
why don't any other fields need to be initialized? e.g. the rng will be garbage.

I also think you need to copy over the prev 'source_size'.

PS2, Line 1048: size
again I'd vote for capacity

PS2, Line 1058:   *state = ReservoirSampleState<T>();
this looks like it'll initialize the memory properly. I think we'd probably need to do this
at l1035

PS2, Line 1084:   ++state->source_size;
I think this gets lost when you allocate a new State.

PS2, Line 1152: ReservoirSampleMerge
I think this algorithm will require some changes to handle varying length States.

PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES
I don't think you can rely on this anymore, dst may not have capacity MAX_NUM_SAMPLES

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-HasComments: Yes

View raw message