flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Gévay <gga...@gmail.com>
Subject Re: DataSet: CombineHint heuristics
Date Tue, 05 Sep 2017 12:32:19 GMT
Hi Urs,

Yes, the 1/10th ratio is just a very loose rule of thumb. I would
suggest to try both the SORT and HASH strategies with a workload that
is as similar as possible to your production workload (similar data,
similar parallelism, etc.), and see which one is faster for your
specific use case.

An important difference between the HASH and SORT strategies is that
the sorting combiner stores the original input records, while the hash
combiner stores only combined records. I.e., when an input record
arrives whose key is already in the hashtable then this record won't
consume additional memory, because it is combined right away.
Therefore, for example, if you would like your combiner to not emit
any records prematurely (i.e., combine everything possible, without
running out of memory), then with the SORT strategy you need combiner
memory proportional to your input size, while with the HASH strategy
you need combiner memory proportional only to the number of keys.

You are correct in that the performance depends very much on how many
records fit into a single Sorter/Hashtable. However, I wrote
#keys/#total records into the documentation because this is easier for
a user to estimate, and this ratio being small correlates with the
HASH strategy getting faster, as explained above.


On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <aljoscha@apache.org> wrote:
> Hi,
> I would say that your assumption is correct and that the COMBINE strategy does in fact
also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable".
> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.
> Best,
> Aljoscha
>> On 31. Aug 2017, at 13:41, Urs Schoenenberger <urs.schoenenberger@tngtech.com>
>> Hi all,
>> I was wondering about the heuristics for CombineHint:
>> Flink uses SORT by default, but the doc for HASH says that we should
>> expect it to be faster if the number of keys is less than 1/10th of the
>> number of records.
>> HASH should be faster if it is able to combine a lot of records, which
>> happens if multiple events for the same key are present in a data chunk
>> *that fits into a combine-hashtable* (cf handling in
>> ReduceCombineDriver.java).
>> Now, if I have 10 billion events and 100 million keys, but only about 1
>> million records fit into a hashtable, the number of matches may be
>> extremely low, so very few events are getting combined (of course, this
>> is similar for SORT as the sorter's memory is bounded, too).
>> Am I correct in assuming that the actual tradeoff is not only based on
>> the ratio of #total records/#keys, but also on #total records/#records
>> that fit into a single Sorter/Hashtable?
>> Thanks,
>> Urs
>> --
>> Urs Schönenberger - urs.schoenenberger@tngtech.com
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082

View raw message