From user-return-19140-archive-asf-public=cust-asf.ponee.io@flink.apache.org Tue Apr 3 16:56:53 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B1646180675 for ; Tue, 3 Apr 2018 16:56:52 +0200 (CEST) Received: (qmail 52824 invoked by uid 500); 3 Apr 2018 14:56:51 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 52814 invoked by uid 99); 3 Apr 2018 14:56:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2018 14:56:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C39F11A17F8 for ; Tue, 3 Apr 2018 14:56:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.735 X-Spam-Level: *** X-Spam-Status: No, score=3.735 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_ENVFROM_END_DIGIT=0.25, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id zHXOW5kU-krj for ; Tue, 3 Apr 2018 14:56:48 +0000 (UTC) Received: from n4.nabble.com (n4.nabble.com [162.253.133.72]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 0D1635FD9F for ; Tue, 3 Apr 2018 14:56:48 +0000 (UTC) Received: from mben.nabble.com (localhost [127.0.0.1]) by n4.nabble.com (Postfix) with ESMTP id C885519CFCE62 for ; Tue, 3 Apr 2018 07:56:46 -0700 (MST) Date: Tue, 3 Apr 2018 07:56:46 -0700 (MST) From: "au.fp2018" To: user@flink.apache.org Message-ID: <1522767406819-0.post@n4.nabble.com> In-Reply-To: <62071ad1-1b83-c4f3-fec6-7dfc0273bb20@apache.org> References: <1522715453815-0.post@n4.nabble.com> <62071ad1-1b83-c4f3-fec6-7dfc0273bb20@apache.org> Subject: Re: Multiple (non-consecutive) keyBy operators in a dataflow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Timo/LiYue, your responses were helpful. I was worried about the network shuffle with the second keyBy. The first keyBy is indeed evenly spreading the load across the nodes. As I mentioned my concern was around the amount of state in each key. Maybe I am trying to optimize pre-maturely here.=20 My follow-up question is: How much state per key is considered big thus causing performance overheads? If I am within this limit after the first keyBy I wouldn't need the second keyBy and thus prevent the network shuffle= . Thanks, Arun Timo Walther wrote > Hi Andre, >=20 > every keyBy is a shuffle over the network and thus introduces some=20 > overhead. Esp. serialization of records between operators if object=20 > reuse is disabled by default. If you think that not all slots (and thus= =20 > all nodes) are not fully occupied evenly in the first keyBy operation=20 > (e.g. if you key space is just 2 values) than it makes sense to have a=20 > second keyBy to do the heavy computation on the more granular key to=20 > have as much parallelism as possible. It really depends on your job. >=20 > I hope this helps. >=20 > Regards, > Timo >=20 >=20 > Am 03.04.18 um 03:22 schrieb =E6=9D=8E=E7=8E=A5: >> Hello, >> In my opinion ,=C2=A0it would be meaningful only on this situation: >> 1. The total size of all your stats is huge enough, e.g. 1GB+. >> 2. Splitting =C2=A0you job to multiple KeyBy process would reduce the si= ze=20 >> of your stats. >> >> Because operation of saving stats is synchronized and all working=20 >> threads are blocked until the saving stats operation finished. >> Our team is trying to make the process of saving stats async, plz=20 >> refer to :=20 >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Slow= -flink-checkpoint-td18946.html >> >> LiYue >> http://tig.jd.com >>=20 > liyue2008@ >> >> >> >>> =E5=9C=A8 2018=E5=B9=B44=E6=9C=883=E6=97=A5=EF=BC=8C=E4=B8=8A=E5=8D=888= :30=EF=BC=8Cau.fp2018 < > au.fp2018@ > =20 > >> <mailto: > au.fp2018@ > >> =E5=86=99=E9=81=93=EF=BC=9A >>> >>> Hello Flink Community, >>> >>> I am relatively new to Flink. In the project I am currently working=20 >>> on I've >>> a dataflow with a keyBy() operator, which I want to convert to=20 >>> dataflow with >>> multiple keyBy() operators like this: >>> >>> >>> =C2=A0Source --> >>> =C2=A0KeyBy() --> >>> =C2=A0Stateful process() function that generates a more granular key --= > >>> =C2=A0KeyBy( > > ) --> >>> =C2=A0More stateful computation(s) --> >>> =C2=A0Sink >>> >>> Are there any downsides to this approach? >>> My reasoning behind the second keyBy() is to reduce the amount of=20 >>> state and >>> hence improve the processing speed. >>> >>> Thanks, >>> Andre >>> >>> >>> >>> >>> -- >>> Sent from:=20 >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.= com/