Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CAC70200C25 for ; Fri, 24 Feb 2017 15:08:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id C967D160B69; Fri, 24 Feb 2017 14:08:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE915160B5C for ; Fri, 24 Feb 2017 15:08:32 +0100 (CET) Received: (qmail 52682 invoked by uid 500); 24 Feb 2017 14:08:32 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 52672 invoked by uid 99); 24 Feb 2017 14:08:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Feb 2017 14:08:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 8CDAB1A00B6 for ; Fri, 24 Feb 2017 14:08:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.133 X-Spam-Level: *** X-Spam-Status: No, score=3.133 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_NEUTRAL=0.652, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=kpibench-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id L16vP90ohPZf for ; Fri, 24 Feb 2017 14:08:29 +0000 (UTC) Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 180915F647 for ; Fri, 24 Feb 2017 14:08:29 +0000 (UTC) Received: by mail-lf0-f47.google.com with SMTP id l12so9285105lfe.0 for ; Fri, 24 Feb 2017 06:08:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kpibench-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=LmEqLqRYqrTlboyCbSVniUCng8chP/0KiMu8RrB19RE=; b=oDUZ9ETPPiLvtUXt+GrOvMP4o9f39MJFPpdmrX3lSTtO834Uf4eB5WQA2rljkbvqCq JKCrz7CgGNugJ89gE/vbuFmx6AbE/XFWkpxJgsCJ1R9i1c5HZ4E8XRTdMDqXWX9Puim5 /6KK3Ojse5AAdh6WvWItcgtJ8jtpgLZz1idfiTimXEj0ojsjXEsdgxPT4ve/toL0jpMa WzlnNenPa5OFlFRfOw1WRkOejjijCE7G64RwxcG5Ajt/kSHaMELm7wL8ukkm2H0lBNLL aTvgCXfVSmEWZghl1CSq/j40d5OfoGFSSct39jm4sVqmSQWXeLYNzbe4CJrMn+yXYUy7 GJ3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=LmEqLqRYqrTlboyCbSVniUCng8chP/0KiMu8RrB19RE=; b=Kjuvh1O1krroy+a54dLcze4aeKjTq8FtWVec4cVM2TX1lKqhoHtLT/9s+Uadx8PQdo 8McfNfWDNU/1t8bkPMAYUtrKcswEQazwt/hE2u4K4knr/drYou4oB9Wpx8DDl6nZO+B/ DxPrPuT6GlWfDPsmS7pv7maAVrarDT99VpHMDUbr8IQ9m5rhh3CxPBoQtJpAjDWCik0n Lys0FxW/OSHsQK4Tn4dE8s22kpIvaI0YhaPyDM3zDulF5LZhygNkdSL2EpD3ZaNyL/zI b2C5nf2y2AFfW5NBZiScoR+IOqMXqopowhUMdAttwYWXevKOCE4cLoXTcb4r/CWaLFsh kO8Q== X-Gm-Message-State: AMke39nwJfCmMD4izf4M0a7RV94UldVkHyU+fD6JawJuauaQRUZwZbLrfK4mx0vy2v/tbyCkOYOlkFT4pw+0Xw== X-Received: by 10.25.21.214 with SMTP id 83mr921696lfv.66.1487945307520; Fri, 24 Feb 2017 06:08:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.211.207 with HTTP; Fri, 24 Feb 2017 06:08:26 -0800 (PST) X-Originating-IP: [81.10.146.108] In-Reply-To: References: From: Patrick Brunmayr Date: Fri, 24 Feb 2017 15:08:26 +0100 Message-ID: Subject: Re: Difference between partition and groupBy To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a114076b080164805494743d4 archived-at: Fri, 24 Feb 2017 14:08:34 -0000 --001a114076b080164805494743d4 Content-Type: text/plain; charset=UTF-8 Thank you for that answer. Helped me a lot 2017-02-23 22:10 GMT+01:00 Fabian Hueske : > Hi Patrick, > > as Robert said, partitionBy() shuffles the data such that all records with > the same key end up in the same partition. That's all it does. > groupBy() also prepares the data in each partition to be processed per > key. For example, if you run a groupReduce after a groupBy(), the data is > first shuffled (just like partitionBy()) and then in each partition sorted > to organize it by key. So groupBy() does more than partitionBy() because it > organizes the data in each partition to be processed by key. > > Moreover, groupBy() alone is not a complete operation but just "prepares" > a following operation. It must be called with a reduce or combine operator. > In contrast partitionBy() is by itself complete. > So the difference between partitionBy() and groupBy() is more than just an > API thing. > > Hope that helps, > Fabian > > 2017-02-23 21:51 GMT+01:00 Robert Metzger : > >> Hi Patrick, >> >> I think (but I'm not 100% sure) its not a difference in what the engine >> does in the end, its more of an API thing. When you are grouping, you can >> perform operations such as reducing afterwards. >> On a partitioned dataset, you can do stuff like processing each partition >> in parallel, or sort them. >> >> The parallelism is independent of the partitioning or grouping. Usually >> there are more partitions than parallel instances, so each instance will >> take care of multiple partitions. >> >> >> >> On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr >> wrote: >> >>> What is the basic difference between partitioning datasets by key or >>> grouping them by key ? >>> >>> Does it make a difference in terms of paralellism ? >>> >>> Thx >>> >> >> > --001a114076b080164805494743d4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thank you for that answer. Helped me a lot

2017-02-23 22:10 GMT+01:00 = Fabian Hueske <fhueske@gmail.com>:
Hi Patrick,

as Robert= said, partitionBy() shuffles the data such that all records with the same = key end up in the same partition. That's all it does.
groupBy(= ) also prepares the data in each partition to be processed per key. For exa= mple, if you run a groupReduce after a groupBy(), the data is first shuffle= d (just like partitionBy()) and then in each partition sorted to organize i= t by key. So groupBy() does more than partitionBy() because it organizes th= e data in each partition to be processed by key.

Moreover= , groupBy() alone is not a complete operation but just "prepares"= a following operation. It must be called with a reduce or combine operator= . In contrast partitionBy() is by itself complete.
So the difference between partitionBy() and groupBy() is more than just an= API thing.

Hope that helps,
Fabian
=
2017-02-23 21:51 GMT+01:00 Robert Metzger <rmetzger@apache.org>:
Hi Patrick,

I think (but I'm not 100% = sure) its not a difference in what the engine does in the end, its more of = an API thing. When you are grouping, you can perform operations such as red= ucing afterwards.
On a partitioned dataset, you can do stuff like= processing each partition in parallel, or sort them.

<= div>The parallelism is independent of the partitioning or grouping. Usually= there are more partitions than parallel instances, so each instance will t= ake care of multiple partitions.


<= div class=3D"m_-4269627110185800565HOEnZb">

On Thu, F= eb 23, 2017 at 6:16 PM, Patrick Brunmayr <jay@kpibench.com> w= rote:
What is the basic = difference between partitioning datasets by key or grouping them by key ?
Does it make a difference in terms of paralellism ?
=

Thx



--001a114076b080164805494743d4--