Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9090318FDC for ; Fri, 5 Feb 2016 15:15:49 +0000 (UTC) Received: (qmail 81719 invoked by uid 500); 5 Feb 2016 15:09:09 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 81627 invoked by uid 500); 5 Feb 2016 15:09:09 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 81617 invoked by uid 99); 5 Feb 2016 15:09:09 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Feb 2016 15:09:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DD2E21A06B5 for ; Fri, 5 Feb 2016 15:09:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mQehlSScgcdi for ; Fri, 5 Feb 2016 15:09:07 +0000 (UTC) Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 76C6E42BC3 for ; Fri, 5 Feb 2016 15:09:07 +0000 (UTC) Received: by mail-ig0-f173.google.com with SMTP id ik10so42156974igb.1 for ; Fri, 05 Feb 2016 07:09:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=5TcfzvxLzWGxh1PpXazw43jskKimG9hpS/8Khr45daw=; b=kJjZXlmOsrJ2r6xOiUzA2EF7breZh7k9cLuG1ROOvInjdN3dJtou/0PvQhpoY2KQ+z U1N+vm5AgcLYN88+aoHI7HZqRO+iJs/oftDhLxjbkeWx+MGM2xbe7r2EHk6p0SfYsbZn BfsHITs+1iFcBZPN3HnTvAd7PcQMB/8AeYIYBnUJ+L4PE3frN68gdo53rEAa1orJIiAx adz9NFkIOv6/oSR6+rhg1HRQ75KJqXsR13zibDiw4pnmgiBqacnYAe6yLC0Hd2l+GEvC RcKANQhOdXWF3Qrq3MNCnKeZvd3olI4ZYHFMfL5ksPR+MWgX2AzUqXZ41O98ZILpiKVN tWfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=5TcfzvxLzWGxh1PpXazw43jskKimG9hpS/8Khr45daw=; b=kv1vpD9DmGZAumRnGH7V9Ai6qiudwzazMYupq2kbBNiGe/AMuMKwWs2KdIAPBbx//S c3ytFBIQZKYvGi9p2RjSp1WwUMvESSleVAGw8Gkh+OIHBmPRyN0TDfMyIIHYR6uyXtHb zkPXodIChC0UMESdJZVaXB0PE6AKYa9MU5q3Cx6AWe16W2SEaGjnUZeVqdg/5lFYk+QL mqFvicZheUk8AHIs/uou7n+n6ellj1dzhpjCG1OVOLT4ZqMMJbbcqkvdsj3HiRxX8lYN G67/CVLrqDrw7CToU7tdYecIkXJK72jr6kMjcAkrHSEhB31QbJBnLBg06mMfWBCkBXSi iatw== X-Gm-Message-State: AG10YOS7OMctMHQRssp28XwiFL7oSBeuQxD3p3rmZaLdRPxLYmIgnwe63ndTl5gDO/MYOY+iC1k4VLmTtyJ6lg== MIME-Version: 1.0 X-Received: by 10.50.115.10 with SMTP id jk10mr9968792igb.71.1454684946984; Fri, 05 Feb 2016 07:09:06 -0800 (PST) Sender: ewenstephan@gmail.com Received: by 10.107.159.194 with HTTP; Fri, 5 Feb 2016 07:09:06 -0800 (PST) In-Reply-To: References: Date: Fri, 5 Feb 2016 16:09:06 +0100 X-Google-Sender-Auth: xJ9rm7ClOI9Bg6hz61TAs9ub_dk Message-ID: Subject: Re: Performance insights From: Stephan Ewen To: user@flink.apache.org Content-Type: multipart/alternative; boundary=089e0115fbcc86799c052b073b1f --089e0115fbcc86799c052b073b1f Content-Type: text/plain; charset=UTF-8 Yes, that is definitely one possible explanation. Another one could be that there is data skew, that increased parallelism does not take work of the most overloaded partition (but reduces available memory from that partition). The web dashboard should actually help you with checking that. On Fri, Feb 5, 2016 at 3:34 PM, Flavio Pompermaier wrote: > Sorry, I forgot to say that the numberOfTaskSlots is always 6.. > > On Fri, Feb 5, 2016 at 3:32 PM, Flavio Pompermaier > wrote: > >> Hi to all, >> >> I'm testing how to speed up my Flink job and I faced the following >> situations in my *6 nodes* cluster (where each node has 8 CPUs) and 1 >> node does also the job manager: >> >> Scenario 1: >> >> - # of network buffers 4096 >> - parallelism: 36 >> - *The job fails because I have not enough network buffers* >> >> Scenario 2: >> >> - # of network buffers *8192* >> - parallelism: 36 >> - *The job ends successfully in about 20 minutes * >> >> Scenario 3: >> >> - # of network buffers *4096* >> - 6 nodes >> - parallelism: *6* >> - *The job ends successfully in about 11 minutes* >> >> What can I infer from those results? That my job is I/O bounded thus >> having more threads in the same machine accessing simultaneously to the >> disk downgrade the performance of the pipeline? >> >> Best, >> Flavio >> > > --089e0115fbcc86799c052b073b1f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Yes, that is definitely one possible explanation.

=
Another one could be that there is data skew, that increased par= allelism does not take work of the most overloaded partition (but reduces a= vailable memory from that partition).
The web dashboard should ac= tually help you with checking that.


On Fri, Feb 5, 2016 at 3:34 PM= , Flavio Pompermaier <pompermaier@okkam.it> wrote:
Sorry, I forgot to say that th= e numberOfTaskSlots is always 6..

On Fri, Feb 5, 2016 at 3:32 PM, Flav= io Pompermaier <pompermaier@okkam.it> wrote:
Hi to all,

<= /p>

I'm testing how to speed up my Flink job and I faced the following= situations in my 6 nodes cluster (where each node has 8 CPUs) and 1= node does also the job manager:

Scenario 1:
=
  • # of network buffers 4096
  • parallelism: 36
  • T= he job fails because I have not enough netwo= rk buffers
Scenario 2:
  • # of netwo= rk buffers=C2=A08192
  • parallel= ism: 36
  • The job ends successfully in about 20 minutes=C2=A0
Scenario 3:
  • # of network buffers=C2=A04096
  • 6 nodes
  • parallelism: 6<= /font>
  • The job ends successfully in about 11 minutes
What can I= infer from those results? That my job is I/O bounded thus having more thre= ads in the same machine accessing simultaneously to the disk downgrade the = performance of the pipeline?

Best,
Flavi= o


--089e0115fbcc86799c052b073b1f--