Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 09969200CE1 for ; Thu, 31 Aug 2017 13:42:13 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 07F2016AF6C; Thu, 31 Aug 2017 11:42:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4D10F16AF67 for ; Thu, 31 Aug 2017 13:42:12 +0200 (CEST) Received: (qmail 5040 invoked by uid 500); 31 Aug 2017 11:42:11 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 5030 invoked by uid 99); 31 Aug 2017 11:42:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Aug 2017 11:42:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BF28E1A2544 for ; Thu, 31 Aug 2017 11:42:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 67Vreli5y2pJ for ; Thu, 31 Aug 2017 11:42:09 +0000 (UTC) Received: from proxy.tng.vnc.biz (zimbra-vnc.tngtech.com [83.144.240.98]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CFD6261269 for ; Thu, 31 Aug 2017 11:42:08 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by proxy.tng.vnc.biz (Postfix) with ESMTP id 7B52E1E19D8 for ; Thu, 31 Aug 2017 13:41:56 +0200 (CEST) Received: from proxy.tng.vnc.biz ([127.0.0.1]) by localhost (proxy.tng.vnc.biz [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id Op6Cxh4M5cGZ for ; Thu, 31 Aug 2017 13:41:56 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by proxy.tng.vnc.biz (Postfix) with ESMTP id DD47D1E19E7 for ; Thu, 31 Aug 2017 13:41:55 +0200 (CEST) X-Virus-Scanned: amavisd-new at Received: from proxy.tng.vnc.biz ([127.0.0.1]) by localhost (proxy.tng.vnc.biz [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id HZT-y1R73qZH for ; Thu, 31 Aug 2017 13:41:48 +0200 (CEST) Received: from [172.31.236.11] (unknown [82.113.113.80]) by proxy.tng.vnc.biz (Postfix) with ESMTPSA id 988FE1E19D4 for ; Thu, 31 Aug 2017 13:41:48 +0200 (CEST) To: user From: Urs Schoenenberger Subject: DataSet: CombineHint heuristics Message-ID: <4607fb76-8825-0be0-45cb-1aba547c9e31@tngtech.com> Date: Thu, 31 Aug 2017 13:41:52 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable archived-at: Thu, 31 Aug 2017 11:42:13 -0000 Hi all, I was wondering about the heuristics for CombineHint: Flink uses SORT by default, but the doc for HASH says that we should expect it to be faster if the number of keys is less than 1/10th of the number of records. HASH should be faster if it is able to combine a lot of records, which happens if multiple events for the same key are present in a data chunk *that fits into a combine-hashtable* (cf handling in ReduceCombineDriver.java). Now, if I have 10 billion events and 100 million keys, but only about 1 million records fit into a hashtable, the number of matches may be extremely low, so very few events are getting combined (of course, this is similar for SORT as the sorter's memory is bounded, too). Am I correct in assuming that the actual tradeoff is not only based on the ratio of #total records/#keys, but also on #total records/#records that fit into a single Sorter/Hashtable? Thanks, Urs --=20 Urs Sch=C3=B6nenberger - urs.schoenenberger@tngtech.com TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Dr. Robert Dahlke, Gerhard M=C3= =BCller Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082