impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taras Bobrovytsky (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4787: Optimize APPX MEDIAN() memory usage
Date Fri, 17 Feb 2017 22:54:00 GMT
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage

Patch Set 2:


Marcel, I will run the targeted perf query later today.
File be/src/exprs/

PS2, Line 963: max_num_samples
> confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity

PS2, Line 973: { return begin() + idx; }
> it'd be good to add bounds checks w/ dchecks

PS2, Line 984: max_num_samples
> capacity

Line 996:   bool IsFull() const {
> dcheck that num_samples <= capacity, i.e. that it is NOT over

PS2, Line 990:   size_t byte_size() const {
             :     return byte_size(max_num_samples);
             :   }
             :   // Returns true if the array of ReservoirSamples that follows this struct
in memory is
             :   // full, and no more elements can be pushed back without resizing.
             :   bool IsFull() const {
             :     return num_samples == max_num_samples;
             :   }
> i think both of these can be one line
One lined one of them, added a DCHECK to the other.

PS2, Line 1001: // make this const
> ?

PS2, Line 1018: resizeReservoirSampleState
> nit: inconsistent casing in fn names
Make this one start with a capital letter. But other ones like begin() should probably start
with a lowercase.

PS2, Line 1021: N_

PS2, Line 1024: new_size 
> new_capacity to avoid conflating with memory sizes

PS2, Line 1024: new_size 
> Queries with a large number buckets will run out of memory much quicker tha

PS2, Line 1024:   int new_size = state->max_num_samples * 10;
              :   DCHECK_EQ(state->max_num_samples, state->num_samples);
              :   DCHECK_LE(new_size, MAX_NUM_SAMPLES);
> Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't th
In this implementation it couldn't DCHECK, because we would start with capacity 20, then go
to 200, 2000, and finally 20000. This is now changed because Mostafa suggested doubling instead.

PS2, Line 1036:   state->max_num_samples = new_size;
> why don't any other fields need to be initialized? e.g. the rng will be gar
The previous state gets copied in ReallocBuffer. The only thing we need to change is is the
capacity. Added comment.

PS2, Line 1048: size
> again I'd vote for capacity
I got rid of this variable.

PS2, Line 1058:   *state = ReservoirSampleState<T>();
> this looks like it'll initialize the memory properly. I think we'd probably
We don't need to do that on line 1035 because everything gets copied over in ReallocBuffer()

PS2, Line 1084:   ++state->source_size;
> I think this gets lost when you allocate a new State.
As mentioned above, everything is copied in realloc

PS2, Line 1152: ReservoirSampleMerge
> I think this algorithm will require some changes to handle varying length S
The capacity shouldn't matter in this algorithm. This algorithm only cares about num samples
in src and dst states.

PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES
> I don't think you can rely on this anymore, dst may not have capacity MAX_N
If DST does not have enough capacity for MAX_NUM_SAMPLES, it will be resized on line 1170

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-HasComments: Yes

View raw message