Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D0BED1787A for ; Thu, 9 Apr 2015 17:48:40 +0000 (UTC) Received: (qmail 56870 invoked by uid 500); 9 Apr 2015 17:48:37 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 56812 invoked by uid 500); 9 Apr 2015 17:48:37 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 56800 invoked by uid 99); 9 Apr 2015 17:48:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2015 17:48:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ewenstephan@gmail.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Apr 2015 17:48:32 +0000 Received: by iedfl3 with SMTP id fl3so120667829ied.1 for ; Thu, 09 Apr 2015 10:47:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=2ted3QL85EyiCmIbr6TlFHlk0ObRfyF+IXnZqz+Ylqs=; b=x3Cc6QXoXyGK0wQTOgncvqR4hC/ws0uiLvHgZsMubc/AHIlPmvNkcFCsNADuln7xQM 6PuPs0mLRv54xGh6wt4TV9fB+AcH5uvGIRUVKhEKf6268r3pbiKpEtbpPjOk2jZQ3LUl ZkJkkxezVqNSCC5+mmfuiYAT4+YxvcKl8DbbriHwgXh0R0x66vjiwY/I48xc5wE4+tVx fDYEp2EtCXHfHV0yLtInUbB7Nxud9ur33jOhdD8TGlWcWO8yq2hDrgN//f1jujGJxn+x x9nhACnJjTAzOBJJywHFzLvg4oyX49cuCjPMweZT5wiOQXDwpFXVrEnIR5plPA1tublz X7uw== MIME-Version: 1.0 X-Received: by 10.50.73.168 with SMTP id m8mr21980783igv.32.1428601621771; Thu, 09 Apr 2015 10:47:01 -0700 (PDT) Sender: ewenstephan@gmail.com Received: by 10.64.76.130 with HTTP; Thu, 9 Apr 2015 10:47:01 -0700 (PDT) In-Reply-To: <55269C8E.8050103@sics.se> References: <55269C8E.8050103@sics.se> Date: Thu, 9 Apr 2015 19:47:01 +0200 X-Google-Sender-Auth: jvXgNByo2_3QncbXAV5219rtf7A Message-ID: Subject: Re: broadcast set size From: Stephan Ewen To: "dev@flink.apache.org" Content-Type: multipart/alternative; boundary=089e013a239c30fdf905134e3ce8 X-Virus-Checked: Checked by ClamAV on apache.org --089e013a239c30fdf905134e3ce8 Content-Type: text/plain; charset=UTF-8 Hi Martin! You can use a broadcast join for that as well. You use it exactly like the usual join, but you write "joinWithTiny" or "joinWithLarge", depending on whether the data set that is the argument to the function is the small or the large one. The broadcast join internally also broadcasts the small data set (like the broadcast sets), but keeps it in the managed memory, which has the benefit that it does spill to disk, if needed. BTW. Many times (if you join two files, a small one and a large one), the Flink optimizer will actually automatically pick the broadcast join. That depends currently on whether the client can access the HDFS and gather statistics, and what kind of functions you use after the data sources and before the join (whether size estimates are still good or already fuzzy). If you want to figure out what strategy is used, have a look at the execution plan (either dump the JSON, put into the HTML file, or use the web client). Greetings, Stephan On Thu, Apr 9, 2015 at 5:36 PM, Martin Neumann wrote: > Hej, > > Up to what sizes are broadcast sets a good idea? > > I have large dataset (~5 GB) and I'm only interested in lines with a > certain ID that I have in a file. The file has ~10 k entries. > I could either Join the dataset with the IDList or I could broadcast the > ID list and do the filtering in a Mapper. > > What would be the better solution given the data sizes described above? > Is there a good rule of thumb when to switch from one solution to the > other? > > cheers Martin > --089e013a239c30fdf905134e3ce8--