Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0F03C200B81 for ; Tue, 13 Sep 2016 20:14:33 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0DA11160AD3; Tue, 13 Sep 2016 18:14:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D1646160AAA for ; Tue, 13 Sep 2016 20:14:31 +0200 (CEST) Received: (qmail 13655 invoked by uid 500); 13 Sep 2016 18:14:30 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 13645 invoked by uid 99); 13 Sep 2016 18:14:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2016 18:14:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 77BFEC6638 for ; Tue, 13 Sep 2016 18:14:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.074 X-Spam-Level: X-Spam-Status: No, score=0.074 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-1.124, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id MMnnxYCccagb for ; Tue, 13 Sep 2016 18:14:28 +0000 (UTC) Received: from nm6-vm3.bullet.mail.gq1.yahoo.com (nm6-vm3.bullet.mail.gq1.yahoo.com [98.136.218.194]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id AF99D5F246 for ; Tue, 13 Sep 2016 18:14:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1473790461; bh=hmBT8kLK/g81qAorwk5wOLIgw+zX6LEe0jdtJjlpVZ0=; h=Date:From:Reply-To:To:In-Reply-To:References:Subject:From:Subject; b=cXpran/5ptTwBfcYw3YCI7wpDnW9zForRxoxuf56FY4RZYZbOg2HqcT5RYHaxte3w7hgydqdP8gtYDPoFdGML4vj4BUwAZiZ5jgXlB+0O4fOCqCsDfr1AIbA6HbFXGgPWOQsIHK2X9+UM3t7fvu9PusW1LfZqgkJBiAgAE4Sd3Ls3yxttE2SaKAYENA8a2aYqGmWbBDDZJp0IHiT23nrjpy1nXxjoKBGC1/a3zcFswNcKIX2lchYy9qlGUGux97Vn/GdVU5rf2WwrfkWmwk1t96o8vCt0bjh6SFZuEDgOJ79gKf0mQVAOG5OlrgDTjOLGUhr9Q/w/hyUmDkPOnBU2w== Received: from [98.137.12.175] by nm6.bullet.mail.gq1.yahoo.com with NNFMP; 13 Sep 2016 18:14:21 -0000 Received: from [98.137.12.212] by tm14.bullet.mail.gq1.yahoo.com with NNFMP; 13 Sep 2016 18:14:21 -0000 Received: from [127.0.0.1] by omp1020.mail.gq1.yahoo.com with NNFMP; 13 Sep 2016 18:14:21 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 412383.30627.bm@omp1020.mail.gq1.yahoo.com X-YMail-OSG: XtghNQMVM1nb5P1JH08mdezRALkGfiCHi3xbhz9yEVFM7wCmDv8YhTSOUIcqfve G36WBLOhnxC6zZSKP1.Salx80TFJs2ZNik5rnK05FQNBZfpmYCt_XRFozhpJEixYyRIkEUQ.3gBE HtoiYShqtBUgmf58WAtbMV5oyLS0sj1U7EnxXIgUekEJ.M0YODbnGYjP.x_DViGBKVcjmvBbXzDm 5Sbg2HiQaY8lKfbOElWw9zB8vxZkvxRBbmmiQHFd1TviSmGFgSDmbTzQ5XxfMpM4T_5ntsSoJ5u2 qi80nb5kq09hwHkOMfBtASGM0rtNU8YGFFyAGlq.pLGCZ4lsox7Y7QnLD7KJvhs6fEE3sAO8Nuj. nvaKTvbJb5kLJKg_GQl1YVIDFUK4HiO8e0c_WH71yw4GEj7BC6mp8NFEBIOqtGtY81qt8gxz_A5v X1k750mGzGymBv_ktbXktDTEUMLqgA2pEv6.VV.YTCEoWr8lLp_k.1cQuq.iXTJ3XcH1je4zcwXl Z_Gzkxp7.tZuDajqNBrba3.A- Received: from jws10768.mail.gq1.yahoo.com by sendmailws144.mail.gq1.yahoo.com; Tue, 13 Sep 2016 18:14:20 +0000; 1473790460.774 Date: Tue, 13 Sep 2016 18:14:16 +0000 (UTC) From: amir bahmanyari Reply-To: amir bahmanyari To: "user@flink.apache.org" Message-ID: <1791313168.1968798.1473790456784@mail.yahoo.com> In-Reply-To: References: <485614424.1525380.1473741336909.ref@mail.yahoo.com> <485614424.1525380.1473741336909@mail.yahoo.com> Subject: Fw: Flink Cluster Load Distribution Question MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1968797_820934192.1473790456772" archived-at: Tue, 13 Sep 2016 18:14:33 -0000 ------=_Part_1968797_820934192.1473790456772 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Robert,Sure, I am forwarding it to user. Sorry about that. I followed th= e "robot's" instructions :))Topology: 4 Azure A11 CentOS 7 nodes (16 cores,= 110 GB). Lets call them node1, 2, 3, 4.Flink Clustered with node1 running = JM & a TM. Three more TM's running on node2,3, and 4 respectively.I have a = Beam running FLink Runner underneath.The input data is received by Beam Tex= tIO() reading off a 1.6 GB of data containing roughly 22 million tuples.All= nodes have identical=C2=A0flink-conf.yaml, masters & slaves contents as fo= llows: flink-conf.yaml: =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0jobmanager.rpc.address: nod= e1 jobmanager.rpc.port: 6123 jobmanager.heap.mb: 1024 taskmanager.heap.mb:= 102400 taskmanager.numberOfTaskSlots: 16 taskmanager.memory.preallocate: = false parallelism.default: 64 jobmanager.web.port: 8081 taskmanager.network= .numberOfBuffers: 4096 masters:=C2=A0node1:8081 slaves:node1node2 node3 node4 Everything looks normal at ./start-cluster.sh & all daemons start on all no= des.JM, TMs log files get generated on all nodes.Dashboard shows how all sl= ots are being used.I deploy the Beam app to the cluster where JM is running= at node1.a *.out file gets generated as data is being processed. No *.out = on other nodes, just node1 where I deployed the fat jar.I tail -f the *.out= log on node1 (master). starts fine...but slowly degrades & becomes extreme= ly slow.As we speak, I started the Beam app 13 hrs ago and its still runnin= g.How can I prove that ALL NODES are involved in processing the data at the= same time i.e. clustered?Do the above configurations look ok for a reasona= ble performance?Given above parameters set, how can I improve the performan= ce in this cluster?What other information and or dashboard screen shots is = needed to clarify this issue.=C2=A0I used these websites to do the configur= ation:Apache Flink: Cluster Setup =20 | =20 | | =20 Apache Flink: Cluster Setup | | | =20 Apache Flink: Configuration =20 | =20 | | =20 Apache Flink: Configuration | | | =20 In the second link, there is a config recommendation for the following but = this parameter is not in the configuration file out of the box: =20 - taskmanager.network.bufferSizeInBytes Should I include it manually? Does it make any difference if the default va= lue i.e.32 KB doesn't get picked up?Sorry too many questions.Pls let me kno= w.I appreciate your help.Cheers,Amir- ----- Forwarded Message ----- From: Robert Metzger To: "dev@flink.apache.org" ; amir bahmanyari =20 Sent: Tuesday, September 13, 2016 1:15 AM Subject: Re: Flink Cluster Load Distribution Question =20 Hi Amir, I would recommend to post such questions to the user@flink mailing list in the future. This list is meant for development-related topics. I think we need more details to understand why your application is not running properly. Can you quickly describe what your topology is doing? Are you setting the parallelism to a value >=3D 1 ? Regards, Robert On Tue, Sep 13, 2016 at 6:35 AM, amir bahmanyari < amirtousa@yahoo.com.invalid> wrote: > Hi Colleagues,Just joined this forum.I have done everything possible to > get a 4 nodes Flink cluster to work peoperly & run a Beam app.It always > generates system-output logs (*.out) in only one node. Its sooooooooo slo= w > for 4 nodes being there.Seems like the load is not distributed amongst al= l > 4 nodes but only one node. Most of the time the one where JM runs.I > run/tested it in a single node, and it took even faster to run the same > load.Not sure whats not being configured right.1- why am I getting > SystemOut .out log in only one server? All nodes get their TaskManager lo= g > files updated thu.2- why dont I see load being distributed amongst all 4 > nodes, but only one all the times.3- Why does the Dashboard show a 0 (zer= o) > for Send/Receive numbers per all Task Managers. > The Dashboard shows all the right stuff. Top shows not much of resources > being stressed on any of the nodes.I can share its contents if it helps > diagnosing the issue.Thanks + I appreciate your valuable time, response & > help.Amir- =20 ------=_Part_1968797_820934192.1473790456772 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Robert,
Sure, I am forwarding it = to user. Sorry about that. I followed the "robot's" instructions :))=
Topolo= gy: 4 Azure A11 CentOS 7 nodes (16 cores, 110 GB). Lets call them node1, 2,= 3, 4.
= Flink Clustered with node1 running JM & a TM. Three more TM's running o= n node2,3, and 4 respectively.
I have a Beam running FLink Runner underneath.
The input d= ata is received by Beam TextIO() reading off a 1.6 GB of data containing ro= ughly 22 million tuples.
All n= odes have identical flink-conf.yaml, masters & slaves cont= ents as follows:

flink-conf.yaml:<= /b>
        jobmanager.rpc.address: no= de1=09
=09jobmanager.rpc.port: 6123
=09j= obmanager.heap.mb: 1024
=09taskmanager.heap.mb: 102400
=09taskmanager.numberOfTaskSlots: 16=09
=09taskmana= ger.memory.preallocate: false
=09parallelism.default: 64
=09jobmanager.web.port: 8081
=09taskmanager.network.numberOfBuffer= s: 4096



masters
node1:8081

slaves:
node1
node2
n= ode3
node4

Everythin= g looks normal at ./start-cluster.sh & all daemons start on all nodes.<= /div>
JM, TMs= log files get generated on all nodes.
Dashboard shows how all slots are being use= d.
I de= ploy the Beam app to the cluster where JM is running at node1.
a *.out file gets g= enerated as data is being processed. No *.out on other nodes, just node1 wh= ere I deployed the fat jar.
I tail -f the *.out log on node1 (master). starts fine= ...but slowly degrades & becomes extremely slow.
As we speak, I started the Be= am app 13 hrs ago and its still running.
How can I prove that ALL NODES are involv= ed in processing the data at the same time i.e. clustered?
Do the above configurat= ions look ok for a reasonable performance?
Given above parameters set, how can I i= mprove the performance in this cluster?
What other information and or dashboard sc= reen shots is needed to clarify this issue. 
I used these websites to do the= configuration:

Apache F= link: Cluster Setup

<= br>
<= div dir=3D"ltr" id=3D"yui_3_16_0_ym19_1_1473566900689_179566">

<= br>
In = the second link, there is a config recommendation for the following but thi= s parameter is not in the configuration file out of the box:
  • taskmanager.network.bufferSizeInBytes
Should I include it manual= ly? Does it make any difference if the default value i.e.32 KB doesn't get = picked up?
Sorry too many questions.
Pls let me know.
I appreciate your help.
Cheers,
Amir-

----- Forwarded Message -= ----
From:<= /span>= Robert Metzger <rmetzger@apache.org>
To: "dev@flink.apache.org" <dev@flink.apache.org>; am= ir bahmanyari <amirtousa@yahoo.com>
Sent: Tuesday, September 13, 2016 1:15 AM
Subject: Re: Flink Cluster L= oad Distribution Question

Hi Amir,

I would recommend to post such questions to the <= a shape=3D"rect" ymailto=3D"mailto:user@flink" href=3D"mailto:user@flink" i= d=3D"yui_3_16_0_ym19_1_1473566900689_179503">user@flink mailing list in=
the future. This list is meant for development-related t= opics.

I think we need more details to= understand why your application is not
running properly.= Can you quickly describe what your topology is doing?
Ar= e you setting the parallelism to a value >=3D 1 ?

Regards,
Robert


On Tue, Sep 13, 2016 at 6:35 AM, amir bahmanyari <
<= a shape=3D"rect" ymailto=3D"mailto:amirtousa@yahoo.com.invalid" href=3D"mai= lto:amirtousa@yahoo.com.invalid">amirtousa@yahoo.com.invalid> wrote:=

> Hi Colleagues,Just joined this f= orum.I have done everything possible to
> get a 4 node= s Flink cluster to work peoperly & run a Beam app.It always
> generates system-output logs (*.out) in only one node. Its sooo= oooooo slow
> for 4 nodes being there.Seems like the l= oad is not distributed amongst all
> 4 nodes but only = one node. Most of the time the one where JM runs.I
> r= un/tested it in a single node, and it took even faster to run the same
> load.Not sure whats not being configured right.1- why am= I getting
> SystemOut .out log in only one server? Al= l nodes get their TaskManager log
> files updated thu.= 2- why dont I see load being distributed amongst all 4
&g= t; nodes, but only one all the times.3- Why does the Dashboard show a 0 (ze= ro)
> for Send/Receive numbers per all Task Managers.<= br clear=3D"none">> The Dashboard shows all the right stuff. Top shows n= ot much of resources
> being stressed on any of the no= des.I can share its contents if it helps
> diagnosing = the issue.Thanks + I appreciate your valuable time, response &
> help.Amir-


------=_Part_1968797_820934192.1473790456772--