flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Hash tables - joins, cogroup, deltaIteration
Date Tue, 19 Apr 2016 10:01:30 GMT
Hi Ovidiu,

Hash tables are currently used for joins (inner & outer) and the solution
set of delta iterations.
There is a pending PR that implements a hash table for partial aggregations
(combiner) [1] which should be added soon.

Joins (inner & outer) are already implemented as Hybrid Hash joins that go
to disk if memory is insufficient.
The hash table used for solution sets does not support spilling. In case of
not enough memory, writing and reading data to / from disk in *each*
iteration would cause a significant performance hit, that possibly
outweighs the benefits of delta iterations. The recommended solution for
this is to add more machines / memory to the cluster.
I am not aware of anybody working on addressing the issue of non-spillable
hash tables of solution sets.

Best,
Fabian

[1] https://github.com/apache/flink/pull/1517

2016-04-18 21:47 GMT+02:00 Ovidiu-Cristian MARCU <
ovidiu-cristian.marcu@inria.fr>:

> Hi,
>
> Can you please confirm if there is any update regarding the hash tables
> use cases, as in [1] it is specified that *Hash tables are used in Joins
> and for the Solution set in iterations (pending work to use them for
> grouping/aggregations)?*
>
> I am interested in the pending work progress and also if you consider to
> add an implementation where Joins and Solution Set in delta iterations (and
> CoGroup) can rely on a hybrid implementation where the engine can use also
> disk if not enough memory available when working with these hash tables.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525 Memory
> Management (Batch API)
>
> Thanks
>
> Best,
> Ovidiu
>

Mime
View raw message