From user-return-19140-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Apr  3 16:56:53 2018
Return-Path: <user-return-19140-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id B1646180675
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  3 Apr 2018 16:56:52 +0200 (CEST)
Received: (qmail 52824 invoked by uid 500); 3 Apr 2018 14:56:51 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 52814 invoked by uid 99); 3 Apr 2018 14:56:51 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2018 14:56:51 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C39F11A17F8
	for <user@flink.apache.org>; Tue,  3 Apr 2018 14:56:50 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.735
X-Spam-Level: ***
X-Spam-Status: No, score=3.735 tagged_above=-999 required=6.31
	tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_ENVFROM_END_DIGIT=0.25,
	NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URI_HEX=1.313]
	autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id zHXOW5kU-krj for <user@flink.apache.org>;
	Tue,  3 Apr 2018 14:56:48 +0000 (UTC)
Received: from n4.nabble.com (n4.nabble.com [162.253.133.72])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0D1635FD9F
	for <user@flink.apache.org>; Tue,  3 Apr 2018 14:56:48 +0000 (UTC)
Received: from mben.nabble.com (localhost [127.0.0.1])
	by n4.nabble.com (Postfix) with ESMTP id C885519CFCE62
	for <user@flink.apache.org>; Tue,  3 Apr 2018 07:56:46 -0700 (MST)
Date: Tue, 3 Apr 2018 07:56:46 -0700 (MST)
From: "au.fp2018" <au.fp2018@gmail.com>
To: user@flink.apache.org
Message-ID: <1522767406819-0.post@n4.nabble.com>
In-Reply-To: <62071ad1-1b83-c4f3-fec6-7dfc0273bb20@apache.org>
References: <1522715453815-0.post@n4.nabble.com> <C9D4409A-AFE3-4DC8-9257-AF3BB61BA53B@gmail.com> <62071ad1-1b83-c4f3-fec6-7dfc0273bb20@apache.org>
Subject: Re: Multiple (non-consecutive) keyBy operators in a dataflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks Timo/LiYue, your responses were helpful.

I was worried about the network shuffle with the second keyBy. The first
keyBy is indeed evenly spreading the load across the nodes. As I mentioned
my concern was around the amount of state in each key. Maybe I am trying to
optimize pre-maturely here.=20

My follow-up question is: How much state per key is considered big thus
causing performance overheads? If I am within this limit after the first
keyBy I wouldn't need the second keyBy and thus prevent the network shuffle=
.

Thanks,
Arun


Timo Walther wrote
> Hi Andre,
>=20
> every keyBy is a shuffle over the network and thus introduces some=20
> overhead. Esp. serialization of records between operators if object=20
> reuse is disabled by default. If you think that not all slots (and thus=
=20
> all nodes) are not fully occupied evenly in the first keyBy operation=20
> (e.g. if you key space is just 2 values) than it makes sense to have a=20
> second keyBy to do the heavy computation on the more granular key to=20
> have as much parallelism as possible. It really depends on your job.
>=20
> I hope this helps.
>=20
> Regards,
> Timo
>=20
>=20
> Am 03.04.18 um 03:22 schrieb =E6=9D=8E=E7=8E=A5:
>> Hello,
>> In my opinion ,=C2=A0it would be meaningful only on this situation:
>> 1. The total size of all your stats is huge enough, e.g. 1GB+.
>> 2. Splitting =C2=A0you job to multiple KeyBy process would reduce the si=
ze=20
>> of your stats.
>>
>> Because operation of saving stats is synchronized and all working=20
>> threads are blocked until the saving stats operation finished.
>> Our team is trying to make the process of saving stats async, plz=20
>> refer to :=20
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Slow=
-flink-checkpoint-td18946.html
>>
>> LiYue
>> http://tig.jd.com
>>=20

> liyue2008@

>>
>>
>>
>>> =E5=9C=A8 2018=E5=B9=B44=E6=9C=883=E6=97=A5=EF=BC=8C=E4=B8=8A=E5=8D=888=
:30=EF=BC=8Cau.fp2018 &lt;

> au.fp2018@

> =20
> &gt;> &lt;mailto:

> au.fp2018@

> &gt;> =E5=86=99=E9=81=93=EF=BC=9A
>>>
>>> Hello Flink Community,
>>>
>>> I am relatively new to Flink. In the project I am currently working=20
>>> on I've
>>> a dataflow with a keyBy() operator, which I want to convert to=20
>>> dataflow with
>>> multiple keyBy() operators like this:
>>>
>>>
>>> =C2=A0Source -->
>>> =C2=A0KeyBy() -->
>>> =C2=A0Stateful process() function that generates a more granular key --=
>
>>> =C2=A0KeyBy(
> <id generated in the previous step>
> ) -->
>>> =C2=A0More stateful computation(s) -->
>>> =C2=A0Sink
>>>
>>> Are there any downsides to this approach?
>>> My reasoning behind the second keyBy() is to reduce the amount of=20
>>> state and
>>> hence improve the processing speed.
>>>
>>> Thanks,
>>> Andre
>>>
>>>
>>>
>>>
>>> --
>>> Sent from:=20
>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>


--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.=
com/