Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C40F0184F9 for ; Wed, 30 Sep 2015 15:09:51 +0000 (UTC) Received: (qmail 5826 invoked by uid 500); 30 Sep 2015 15:09:29 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 5745 invoked by uid 500); 30 Sep 2015 15:09:29 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 5735 invoked by uid 99); 30 Sep 2015 15:09:29 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2015 15:09:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 200341A0750 for ; Wed, 30 Sep 2015 15:09:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id x4tpuku-_qoe for ; Wed, 30 Sep 2015 15:09:27 +0000 (UTC) Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 4672320313 for ; Wed, 30 Sep 2015 15:09:27 +0000 (UTC) Received: by labzv5 with SMTP id zv5so50305319lab.1 for ; Wed, 30 Sep 2015 08:09:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=EUXuHu6jR11uX/CtQgvqKQaJoXKDWJpdsN2z/M9MCyo=; b=EnHmBEPE/Vbuw1Q04XwQcuaeNIWZHUd089ImQbcFOH7k3e0AS1MMKOEiXjq7qlPtnX cNcL70drFVpsSB1WOSEEiDtz4Mud/w2aJNIT1aqBaEmQ7y2iCLppbs0OPW/hcB+mOGeR D+oL63TwzgJwNBbuBabRbM6UGKG4iLyM3Prq9xn3GpNUs/UMtSrhofWv+2WpeFn1eKyg rISYPIJy1sdYnweD3pdoME/n/FfbYv3xsQsV9nPIrCjFB7M1L61nNzzkmC7KDtCw4wka KUV2ZngwQdewCOdwOyFlLHPPjX+wTqlh3tpUgQ7F506EuBNwVcLaCcPSrnFCCpisMlql ba1w== MIME-Version: 1.0 X-Received: by 10.25.33.145 with SMTP id h139mr328091lfh.123.1443625766729; Wed, 30 Sep 2015 08:09:26 -0700 (PDT) Received: by 10.112.146.67 with HTTP; Wed, 30 Sep 2015 08:09:26 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Sep 2015 17:09:26 +0200 Message-ID: Subject: Re: All but one TMs connect when JM has more than 16G of memory From: Robert Schmidtke To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a1141015c03ceda0520f851fc --001a1141015c03ceda0520f851fc Content-Type: text/plain; charset=UTF-8 I should say I'm running the current Flink master branch. On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke wrote: > It's me again. This is a strange issue, I hope I managed to find the right > keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of > memory each. > > When running my job like so: > > $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 ..... > > The job completes without any problems. When running it like so: > > $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 ..... > > (note the one more M of memory for the JM), the execution stalls, > continuously reporting: > > ..... > TaskManager status (6/7) > TaskManager status (6/7) > TaskManager status (6/7) > ..... > > I did some poking around, but I couldn't find any direct correlation with > the code. > > The JM log says: > > ..... > 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ > - JVM Options: > 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ > - -Xmx12289M > ..... > > but then continues to report > > ..... > 16:52:59,311 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > 16:52:59,831 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > 16:53:00,351 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > ..... > > forever until I cancel the job. > > If you have any ideas I'm happy to try them out. Thanks in advance for any > hints! Cheers. > > Robert > -- > My GPG Key ID: 336E2680 > -- My GPG Key ID: 336E2680 --001a1141015c03ceda0520f851fc Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I should say I'm running the current Flink master bran= ch.

On Wed, = Sep 30, 2015 at 5:02 PM, Robert Schmidtke <ro.schmidtke@gmail.com= > wrote:
It= 9;s me again. This is a strange issue, I hope I managed to find the right k= eywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of me= mory each.

When running my job like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -y= n 7 .....

The job completes without any proble= ms. When running it like so:

$FLINK_HOME/bin/flink= run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....

(note the one more M of memory for the JM), the execution stalls, = continuously reporting:

.....
TaskM= anager status (6/7)
TaskManager status (6/7)
TaskManage= r status (6/7)
.....

I did some po= king around, but I couldn't find any direct correlation with the code.<= /div>

The JM log says:

.....
16:49:01,893 INFO =C2=A0org.apache.flink.yarn.ApplicationMaste= r$ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0- =C2=A0JVM Options:
16:49:01,893 INFO =C2=A0org.apache.fli= nk.yarn.ApplicationMaster$ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0- =C2=A0 =C2=A0 -Xmx12289M
....= .

but then continues to report

.....
16:52:59,311 INFO =C2=A0org.apache.flink.yarn.Ap= plicationMaster$$anonfun$2$$anon$1 =C2=A0 =C2=A0- The user requested 7 cont= ainers, 6 running. 1 containers missing
16:52:59,831 INFO =C2=A0o= rg.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 =C2=A0 =C2=A0- Th= e user requested 7 containers, 6 running. 1 containers missing
16= :53:00,351 INFO =C2=A0org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$a= non$1 =C2=A0 =C2=A0- The user requested 7 containers, 6 running. 1 containe= rs missing
.....

forever until I c= ancel the job.

If you have any ideas I'm happy= to try them out. Thanks in advance for any hints! Cheers.

Robert
--=
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680
--001a1141015c03ceda0520f851fc--