Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 44796180FC for ; Tue, 23 Feb 2016 05:36:19 +0000 (UTC) Received: (qmail 98520 invoked by uid 500); 23 Feb 2016 05:36:12 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 98430 invoked by uid 500); 23 Feb 2016 05:36:12 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 98419 invoked by uid 99); 23 Feb 2016 05:36:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Feb 2016 05:36:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 48F63C12DF for ; Tue, 23 Feb 2016 05:36:12 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.449 X-Spam-Level: * X-Spam-Status: No, score=1.449 tagged_above=-999 required=6.31 tests=[AC_DIV_BONANZA=0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id cZRCElgPFscq for ; Tue, 23 Feb 2016 05:36:10 +0000 (UTC) Received: from mail-io0-f181.google.com (mail-io0-f181.google.com [209.85.223.181]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id F386E5FACE for ; Tue, 23 Feb 2016 05:36:09 +0000 (UTC) Received: by mail-io0-f181.google.com with SMTP id z135so202417506iof.0 for ; Mon, 22 Feb 2016 21:36:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=jbui+McDWQX3tSeJ9a7xDdten7kBOH+LUlFBj8eURXM=; b=bnHxWr7DL7FEh+TXZFoaIvEDt0M0b6pABHHxWzVkNp5PxBHdMECyncX7BqC2TACOZv /OkV+Lx7uw7pJme4aUp7CqIkj3a0u7d/VDXDro4mEdSpTUPcAULroepeSLx9ihrQrO83 WenJOqC2M0rh4m/iQmgEkTSbI1ZEfRo0J5kA/48+Lc9MdW/FhkUSY3n/boilu4jt+tK7 ol2aHVJnqqyUW6xfdVGeHxiAWtbdZqqsI4xinnuUxxfqXuvRRX/bFnH6qPlDj3nC6KO+ cpEmLVPwgpkh+U9t07/6h9ybEN1z3dUuJy+ua80KJ6+Vf6igSvqjXk7HwptVIFSEyrzK dLQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=jbui+McDWQX3tSeJ9a7xDdten7kBOH+LUlFBj8eURXM=; b=L/DH4cdwUGLBGpYRR5dXJ5ldYoe0lN17g+mR1/hci5uz+4izweEZZ+XICcaatqa9S1 gKjj84uhOk/KJ1AeW6SC//r/ZTc+3KO2i0eR4kCn4SyugvsAau9hy2+oJEUYrK7FOxII axKlo+pd0nDDbpS2bd3VWwzWfWJ8XDx5sUFnXYSaW0iupT6IXzmQaEwljRi+v/IeWL47 bQzTUXNkdsvJai3oGjtAQ+zPllmoisxzAbd9RzzJCT2ZZjRihap/E9dmkQ/P2ckY6c+p gY/oDBVNTuxN83GYYei90eSjkZe58/Xzm3u82PbMn1rKZgRcRoqzZ+gwsIs65yxaO+7F JN4g== X-Gm-Message-State: AG10YORuKRmOJMRGccRvo2k/JE7pV/2fTThxIqerEKqwDGNLoErbqSrFEEvx+pR4Ug7QeRJWZACjeggs3zOcKw== MIME-Version: 1.0 X-Received: by 10.107.170.79 with SMTP id t76mr39297758ioe.71.1456205769067; Mon, 22 Feb 2016 21:36:09 -0800 (PST) Received: by 10.107.1.197 with HTTP; Mon, 22 Feb 2016 21:36:09 -0800 (PST) In-Reply-To: References: Date: Tue, 23 Feb 2016 12:36:09 +0700 Message-ID: Subject: Re: Optimal Configuration for Cluster From: Welly Tambunan To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a114157c69600d1052c6953b9 --001a114157c69600d1052c6953b9 Content-Type: text/plain; charset=UTF-8 Hi Fabian, Previously when using flink 0.9-0.10 we start the cluster with streaming mode or batch mode. I see that this one is gone on Flink 1.00 snapshot ? So this one has already taken care of the flink and optimize by runtime > On Mon, Feb 22, 2016 at 5:26 PM, Fabian Hueske wrote: > Hi Welly, > > sorry for the late response. > > The number of network buffers primarily depends on the maximum parallelism > of your job. > The given formula assumes a specific cluster configuration (1 task manager > per machine, one parallel task per CPU). > The formula can be translated to: > > taskmanager.network.numberOfBuffers: p ^ 2 * t * 4 > > where p is the maximum parallelism of the job and t is the number of task > manager. > You can process more than one parallel task per TM if you configure more > than one processing slot per machine ( taskmanager.numberOfTaskSlots). > The TM will divide its memory among all its slots. So it would be possible > to start one TM for each machine with 100GB+ memory and 48 slots each. > > We can compute the number of network buffers if you give a few more > details about your setup: > - How many task managers do you start? I assume more than one TM per > machine given that you assign only 4GB of memory out of 128GB to each TM. > - What is the maximum parallelism of you program? > - How many processing slots do you configure for each TM? > > In general, pipelined shuffles with a high parallelism require a lot of > memory. > If you configure batch instead of pipelined transfer, the memory > requirement goes down > (ExecutionConfig.setExecutionMode(ExecutionMode.BATCH)). > > Eventually, we want to merge the network buffer and the managed memory > pools. So the "taskmanager.network.numberOfBuffers" configuration whill > hopefully disappear at some point in the future. > > Best, Fabian > > 2016-02-19 9:34 GMT+01:00 Welly Tambunan : > >> Hi All, >> >> We are trying to running our job in cluster that has this information >> >> 1. # of machine: 16 >> 2. memory : 128 gb >> 3. # of core : 48 >> >> However when we try to run we have an exception. >> >> "insufficient number of network buffers. 48 required but only 10 >> available. the total number of network buffers is currently set to 2048" >> >> After looking at the documentation we set configuration based on docs >> >> taskmanager.network.numberOfBuffers: # core ^ 2 * # machine * 4 >> >> However we face another error from JVM >> >> java.io.IOException: Cannot allocate network buffer pool: Could not >> allocate enough memory segments for NetworkBufferPool (required (Mb): 2304, >> allocated (Mb): 698, missing (Mb): 1606). Cause: Java heap space >> >> We fiddle the taskmanager.heap.mb: 4096 >> >> Finally the cluster is running. >> >> However i'm still not sure about the configuration and fiddling in task >> manager heap really fine tune. So my question is >> >> >> 1. Am i doing it right for numberOfBuffers ? >> 2. How much should we allocate on taskmanager.heap.mb given the >> information >> 3. Any suggestion which configuration we need to set to make it >> optimal for the cluster ? >> 4. Is there any chance that this will get automatically resolve by >> memory/network buffer manager ? >> >> Thanks a lot for the help >> >> Cheers >> >> -- >> Welly Tambunan >> Triplelands >> >> http://weltam.wordpress.com >> http://www.triplelands.com >> > > -- Welly Tambunan Triplelands http://weltam.wordpress.com http://www.triplelands.com --001a114157c69600d1052c6953b9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Fabian,=C2=A0

Previously when using = flink 0.9-0.10 we start the cluster with streaming mode or batch mode. I se= e that this one is gone on Flink 1.00 snapshot ? So this one has already ta= ken care of the flink and optimize by runtime >

On Mon, Feb 22, 2016 at 5:26 P= M, Fabian Hueske <fhueske@gmail.com> wrote:
=
Hi Welly,

sorry for the late response.

= The number of network buffers primarily depends on the maximum parallelism = of your job.
The given formula assumes a specific cluster co= nfiguration (1 task manager per machine, one parallel task per CPU).
The formula can be translated to:

taskmanager.network.n= umberOfBuffers: p ^ 2 * t * 4

where p is the maximum par= allelism of the job and t is the number of task manager.
You= can process more than one parallel task per TM if you configure more than = one processing slot per machine ( taskmanager.numberOfTaskSlots). The TM will divide its mem= ory among all its slots. So it would be possible to start one TM for each m= achine with 100GB+ memory and 48 slots each.

We can= compute the number of network buffers if you give a few more details about= your setup:
- How many task managers do you start? I assume more = than one TM per machine given that you assign only 4GB of memory out of 128= GB to each TM.
- What is the maximum parallelism of you= program?
- How many processing slots do you configure for each TM?
<= /div>

In general, pipelined shuffles with a high parallelism = require a lot of memory.
If you configure batch instead of pipeli= ned transfer, the memory requirement goes down (ExecutionConfig.setExecutio= nMode(ExecutionMode.BATCH)).

Eventually, we want to merge the = network buffer and the managed memory pools. So the "taskmanager.netwo= rk.numberOfBuffers" configuration whill hopefully disappear at some po= int in the future.

Best, Fabian

2016-02-19 9:34 GMT+01:00 Welly Tambunan <if05041@gmail.com>:
Hi All,=C2=A0

=
We are trying to running our job in cluster= that has this information

<= div style=3D"font-size:12.8px">
1. # of machine: 16=C2=A0
2. me= mory : 128 gb=C2=A0
3. # of core : 48=C2=A0

<= /div>
However when we try to run we have an exception.=C2=A0

&qu= ot;insufficient number of network buffers. 48 required but only 10 availabl= e. the total number of network buffers is currently set to 2048"

After looking at the documentation we set configura= tion based on docs

taskmanager.network.numberOfBuf= fers: # core ^ 2 * # machine * 4=C2=A0

However we = face another error from JVM

java.io.IOException: Cannot allocate net= work buffer pool: Could not allocate enough memory segments for NetworkBuff= erPool (required (Mb): 2304, allocated (Mb): 698, missing (Mb): 1606). Caus= e: Java heap space

We fiddle the=C2=A0taskmanager.heap.mb:=C2=A04096=

Finally the cluster is running.=C2=A0
<= br>
However i'm still not sure about the configuration and fi= ddling in task manager heap really fine tune. So my question is
<= br>
  1. Am i doing it right for nu= mberOfBuffers ?
  2. How much should we alloc= ate on taskmanager.heap.mb given the information
  3. Any suggestion which configuration we need to set to make it optim= al for the cluster ?=C2=A0
  4. Is there any = chance that this will get automatically resolve by memory/network buffer ma= nager ?
Thanks a lot for the help

<= div>Cheers

--
<= div>Welly Tambunan
Triplelands=C2=A0

http://weltam.wordpress.com




--
=
Welly Tambunan
Triplelands=C2=A0

<= a href=3D"http://weltam.wordpress.com" target=3D"_blank">http://weltam.word= press.com
--001a114157c69600d1052c6953b9--