drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boaz Ben-Zvi <bben-...@mapr.com>
Subject Re: Drill: Memory Spilling for the Hash Aggregate Operator
Date Thu, 19 Jan 2017 22:24:39 GMT
If the number of (sub ?) partitions equals the (fixed!) available memory size divided by the
block size, then spilling would have to occur one block at a time, whenever any block becomes
This has a drawback for spilling into a hard drive - each block would likely be placed on
disk away from the prior written block (for that partition), hence when the whole partition
is read back into memory, the read will incur lots of disk SEEKs !!
Spilling multiple blocks at a time significantly mitigates this problem. (Or future SSD storage).

If all the sub-partitions are written into the same file, then how can a single (or few) sub-partition
be read without going through the whole file ?  Do the stats hold the offsets of each block
per partition, and then the reader “hops” into those offsets ?
This assumes that the size of the memory block matches the disk’s block (e.g., if the latter
is much bigger, then the disk “over reads” and does not get high utilization).
Another partial solution can be to initially mix writing sub-partitions into a single file,
and once some size limit was reached for one of the sub-partitions, then continue spilling
this particular sub-partition into a separate file.    

About "some partition blocks never completed and written to disk”: If those belong to partitions
that were spilled, then likely they would have to be written to disk once all the original
input was read.
That’s because once reading back a partition from disk, all of the available memory is likely
needed; having blocks of the following partitions just “hanging” there is a waste of memory.
Maybe with stats, we could pick the “smaller” partitions first, which would “clean up”
some of those partially filled blocks, thus freeing memory for the bigger ones. 
This seems like a big code overhead. In the first Drill implementation we plan to just spill
all the partial blocks (that belonging to spilled partitions) once all the input was read.

Another problem I just realized (with the Drill architecture in general) is that the execution
is only driven by the next() calls. That makes it harder to benefit from a fast disk scan
of many blocks.
The current plan for Hash Aggregate spill is to read a spilled partition the same way “incoming”
is read - read one batch, do the aggregations, then the next batch, etc.
Any improvement would add complexities (e.g. memory management); so maybe would be done in
some later time. 

  — Boaz 

> On Jan 17, 2017, at 10:20 PM, Julian Hyde <jhyde@apache.org> wrote:
> I agree that an aggregate tends to compress data more than a join. Joins do compress
data somewhat — when they have a filtering effect — so for both hash-aggregation and hash-join
the size estimate is just an upper bound.
> I also agree that the hash aggregate will fit data into memory in either the first or
second phase. But the “much larger than memory” case is still very important. Think of
what you would do to make an almost-unique list of customer ids into a fully unique list.
> HHJ does not “mix up” sub-partitions, not in a bad way, anyway. By design HHJ uses
as many output partitions as possible (available memory divided by block size, because each
partition needs one block of buffer space to prepare the data to be written), so it would
not be practical to have one output file per sub-partition. There are far more sub-partitions
than could be efficiently written to disk, so HHJ merely collects stats for them. All of the
data in a partition will be read back at once, so the “mixing up” is not harmful.
> HHJ’s I/O is extremely efficient; I don’t believe that it ever does random access
or reads a byte more than once. And due to its “hybrid” nature, some partition blocks
are never completed and written to disk.
> Anyway. I’ve said my piece. Histograms of sub-partition stats are not an essential
part of this algorithm but I wanted to make sure that you were aware of them, because they
are as elegant as bloom filters and b-trees.
> Julian
>> On Jan 16, 2017, at 6:55 PM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:
>> The design in the document has an undocumented assumption — allow the memory available
for an operator to fluctuate during the operator’s lifetime.
>> For example, the Hash Agregate operator (HAG) may hit the memory limit at 500MB,
than later hit it again at 600MB, etc, and when reading a spilled partition back from disk
the limit may be 400MB, etc.
>> (A more gentle scheme may allow memory to increase, never go down; e.g., other operators
may "give back” some of their allocations during execution; 
>> e.g. HHJ after finishing the build phase — can yield the extra memory to another
>> This is a more sophisticated memory management design than what we have now — a
simple fixed pre-allocation. 
>> A second point is Hash Aggregation — which is a little different from HHJ. For
example, spilling a partition X (in multiple spill iterations) ends up with 500MB on disk,
but only (a fixed) 400MB memory is available.
>> Does this mean that partition X would not fit into the memory ?  No, as X may contain
many duplicate groups (groups were recreated after every spill), so as we read and aggregate
X, its size may “shrink” enough for the 400MB.
>> Indeed the design was not trying to avoid “re-writing to disk”; however the assumption
was that this would be a rare situation, and may only apply to a part of the data.
>> If the initial number of partitions chosen N is large enough, then spilling would
happen only once per (some of the) data.
>> For HAG, the partitions should all be of similar sizes, so if N was chosen too small,
they all would need a “secondary spilling” which would include re-writing all the data
(that’s the analogous to “phase 3” ….)
>> HHJ is a little different (due to possible key duplicate rows) — some partitions
may be bigger than others. So maybe only few HHJ partitions would be written twice.
>> The downside of the “sub-partition” model described in Goetz’ paper is that
indeed a second write of all the data is saved, but all the data (mixed together) needs to
be read (up to) n times (plus some CPU overhead to filter out the needed parts).
>> Read may be costly, due to seeks (e.g., when a partition was spilled in many iterations,
and ended scattered across the disk.) 
>> For the (future) HHJ design — we could handle a “too big” partition using a
“hash loop”; that is, read only a part of the inner partition, then scan the whole outer
matching partition, then read the next inner part, and scan the whole outer partition again.
>> This is indeed costly, but not much different from the “sub-partitioning” scheme
— we just read part of the inner based on size (instead of several parts, like 0.0, 0.1,
and 0.2), and then the whole outer without filtering - the hash table probing would do that.
>>    Thanks for the suggestions and the link; I’ll go over Goetz’ paper again and
look for more ideas.
>>         — Boaz
>>> On Jan 16, 2017, at 4:09 PM, Julian Hyde <jhyde@apache.org> wrote:
>>> Does the data need to be written into a disk-friendly format when a partition
is selected to be written to disk? If you are careful in your choice of format then it doesn’t
need to be re-written. And in fact you can start with the assumption that everything is going
to disk.
>>> One of the most beautiful innovations of the HHJ paper [1] is the histogram-driven
partitioning. Basically, during phase 1 you apply the phase 2 hash function to assign rows
to “sub-partitions”. Partition 0 would contain sub-partitions 0.0, 0.1, … 0.n; partition
1 would contain sub-partitions 1.1, …, 1.n. The rows are all mixed together in each partition,
but you know how many rows (and bytes) are in each sub-partition. If partition 0 (or any partition)
ends up larger than memory then you are going to need a phase 3. But you can enter phase 2
armed with some very useful knowledge. You know the sizes of the sub-partitions and you can
choose a hash function in phase 2 such that many of the partitions end up *just* smaller than
>>> The big problem with external sort and hash algorithms is the huge performance
hit when you require an extra phase. If you need 2 phases, HHJ can convert that to 1.5 phases
(by pulling smaller partitions back into memory) and by optimizing the assignment of rows
to partitions it can turn a 3.1 phase query into a 2.9 phase query - a big win.
>>> Julian
>>> [1] https://pdfs.semanticscholar.org/fc1c/78cbef5062cf49fdb309b1935af08b759d2d.pdf
>>>> On Jan 14, 2017, at 7:34 AM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:
>>>> Sorry for no attachment (Apache mail rules) -- Here is a link to the document:
>>>> DrillSpillmemoryforHashAggregation.pdf - https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing
>>>> [https://lh3.googleusercontent.com/U9FNbWEBljT-HDRBE1-vhMnE4Ug5YFgutztvbys2UnTiVp-FQX6mzQ=w1200-h630-p]<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>>> DrillSpillmemoryforHashAggregation.pdf<https://drive.google.com/file/d/0ByUg32jfEW16ajNiQlVRczhPTjA/view?usp=sharing>
>>>> drive.google.com
>>>> -- Boaz
>>>> ________________________________
>>>> From: Julian Hyde <jhyde@apache.org>
>>>> Sent: Friday, January 13, 2017 11:00 PM
>>>> To: dev@drill.apache.org
>>>> Subject: Re: Drill: Memory Spilling for the Hash Aggregate Operator
>>>> The attachment didn't come through. I'm hoping that you settled on a "hybrid"
hash algorithm that can write to disk, or write to memory, and the cost of discovering that
is wrong is not too great. With Goetz Graefe's hybrid hash join (which can be easily adapted
to hybrid hash aggregate) if the input ALMOST fits in memory you could process most of it
in memory, then revisit the stuff you spilled to disk.
>>>>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bben-zvi@mapr.com> wrote:
>>>>> Hi Drill developers,
>>>>>  Attached is a document describing the design for memory spilling implementation
for the Hash Aggregate operator.
>>>>>  Please send me any comments or questions,
>>>>>     -- Boaz

View raw message