From user-return-28473-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed Jul 10 06:20:52 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B670018064A for ; Wed, 10 Jul 2019 08:20:51 +0200 (CEST) Received: (qmail 68008 invoked by uid 500); 10 Jul 2019 06:20:49 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 67998 invoked by uid 99); 10 Jul 2019 06:20:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jul 2019 06:20:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 730FF1A3204 for ; Wed, 10 Jul 2019 06:20:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.344 X-Spam-Level: *** X-Spam-Status: No, score=3.344 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, PDS_NO_HELO_DNS=1.294, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id cpSu2ZM51Qbi for ; Wed, 10 Jul 2019 06:20:43 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.66; helo=mail-io1-f66.google.com; envelope-from=tonysong820@gmail.com; receiver= Received: from mail-io1-f66.google.com (mail-io1-f66.google.com [209.85.166.66]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 5AB7ABC52B for ; Wed, 10 Jul 2019 06:20:43 +0000 (UTC) Received: by mail-io1-f66.google.com with SMTP id o9so2289447iom.3 for ; Tue, 09 Jul 2019 23:20:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z7Z3t9G/ScObW4I3qVy+5ml1DPX23J2e1Oz4RTffi2s=; b=TZiFzOqTmlf17jm882CqGfErmma1ggjBSdWN5Kak+K9Ixi74/Qpi8hQR/cNGuUMZ6X iaYROx9MNCt8HfJGNc7TYDHmpBVAugW7bNxFUBjM3xkGbIzUUlIM6r17nN0YfkfNy5MU riNVt3BImTZcdgiFfPC8fMcpTtmEW+Ilf6hFTLtB8+c1TuPs7Et0dnIHcIH8EGJ57iek hdY8e1Dp0ajy+Y6LWbsF9oNaaSl7/x6Pi4O/iuQ5YvN3YA8EvtoTs/zeJ7HytoWXJqqA XivJ97liTE60s5sp/vYF65kA4Dc0uYpxzpzWmtBMrDTfTVvSosGe5QWfHyTTDlDF6ZD7 Ddgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z7Z3t9G/ScObW4I3qVy+5ml1DPX23J2e1Oz4RTffi2s=; b=onk/DpqQ4uYrUh0RmjhNhQE2yMqc85OTF7b1McQI2lVsuu4CZhpHB/VLocBT2j88mA 9Id4AGeQwq10+yfNDJexJ9e6l+E7gjI+yMNen6PtnGsqjjw2GGEsWh3BuxDiw1G+tJpz 6HPKc6ba+tDePFtvzylpgv2IJahsvq4JXu2NiL5nCuO6sDA7fGhCQnnx61g+XUbQGzJX ah/rfi1nL2wMtcdCrGa0NmpxhoSo7BvD3iJD/jRMeXdw9Zj7O6nKJw/HT1F8Arv1ZZVu Uj/KYXkOq3dywJolD3ME0rYrrMFKUcVFjnOKOOkwPGWrawEh98qH90Pp4fAPNmGHoTE3 8X9w== X-Gm-Message-State: APjAAAXkP7cVCo306cwkxmKRJ8M98ZGFyASOwR5c9ChxMbPH7CBZDj52 K99TA+/+0Ej4jizRkhf879hWP6LH6tSikFKWLjk= X-Google-Smtp-Source: APXvYqzSppVy7kpvludUrkDtOL2XRvgzNDFL7W6phnJaa1yirQx8eGiNWhDr9tpB06gtJaUGPshXWAW4gSvoG4exlrs= X-Received: by 2002:a6b:5a0b:: with SMTP id o11mr2201085iob.98.1562739642754; Tue, 09 Jul 2019 23:20:42 -0700 (PDT) MIME-Version: 1.0 References: <9E9AD3E9-CC2B-4B3C-94DF-F5DB8D9B6C07@gmail.com> <2851b341.4ef4.16bd9be2db5.Coremail.sunhaibotb@163.com> <167241f.67ec.16bda01f807.Coremail.sunhaibotb@163.com> <9E7EF693-9FB5-4081-8DE3-C19A32E2EE67@gmail.com> In-Reply-To: <9E7EF693-9FB5-4081-8DE3-C19A32E2EE67@gmail.com> From: Xintong Song Date: Wed, 10 Jul 2019 14:20:31 +0800 Message-ID: Subject: Re: YarnResourceManager unresponsive under heavy containers allocations To: qi luo Cc: Haibo Sun , user , Anyang Hu Content-Type: multipart/alternative; boundary="000000000000486b1c058d4dad5c" --000000000000486b1c058d4dad5c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for the kindly offer, Qi. I think this work should not take much time, so I can take care of it. It's just the community is currently under feature freeze for release 1.9, so we need to wait until the release code branch being cut. Thank you~ Xintong Song On Wed, Jul 10, 2019 at 1:55 PM qi luo wrote: > Thanks Xintong and Haibo, I=E2=80=99ve found the fix in the blink branch. > > We=E2=80=99re also glad to help contribute this patch to community versio= n, in > case you don=E2=80=99t have time. > > Regards, > Qi > > On Jul 10, 2019, at 11:51 AM, Haibo Sun wrote: > > Hi, Qi > > Sorry, by talking to Xintong Song offline, I made sure that some of what > I said before is wrong. Please refer to Xintong's answer. > > Best, > Haibo > > At 2019-07-10 11:37:16, "Xintong Song" wrote: > > Hi Qi, > > Thanks for reporting this problem. I think you are right. Starting large > amount of TMs in main thread on YARN could take relative long time, causi= ng > RM to become unresponsive. In our enterprise version Blink, we actually > have a thread pool for starting TMs. I think we should contribute this > feature to the community version as well. I created a JIRA ticket > from which you can > track the progress of this issue. > > For solving your problem at the moment, I agree with Haibo that configure > a larger registration timeout could be a workaround. > > Thank you~ > Xintong Song > > > > On Wed, Jul 10, 2019 at 10:37 AM Haibo Sun wrote: > >> Hi, Qi >> >> According to our experience, it is no problem to allocate more than 1000 >> containers when the registration timeout is set 5 minutes . Perhaps ther= e >> are other reasons? Or you can try to increase the value of >> `taskmanager.registration.timeout`. For allocating containers using >> multi-thread, I personally think it's going to get very complicated, and >> the more recommended way is to put some waiting works into asynchronous >> processing, so as to liberate the main thread. >> >> Best, >> Haibo >> >> >> At 2019-07-09 21:05:51, "qi luo" wrote: >> >> Hi guys, >> >> We=E2=80=99re using latest version Flink YarnResourceManager, but our jo= b startup >> occasionally hangs when allocating many Yarn containers (e.g. >1000). I >> checked the related code in YarnResourceManager as below: >> >> >> >> It seems that it handles all allocated containers and starts TM in main >> thread. Thus when containers allocations are heavy, the RM thread become= s >> unresponsive (such as no response to TM heartbeats, see TM logs as below= ). >> >> Any idea on how to better handle such case (e.g. multi-threading to >> handle allocated containers) would be very appreciated. Thanks! >> >> Regards, >> Qi >> >> >> =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 >> TM log: >> >> 2019-07-09 13:56:59,110 INFO org.apache.flink.runtime.taskexecutor.Task= Executor - Connecting to ResourceManager akka.tcp://flink@xxx/us= er/resourcemanager(00000000000000000000000000000000). >> 2019-07-09 14:00:01,138 INFO org.apache.flink.runtime.taskexecutor.Task= Executor - Could not resolve ResourceManager address akka.tcp://= flink@xxx/user/resourcemanager, retrying in 10000 ms: Ask timed out on [Act= orSelection[Anchor(akka.tcp://flink@xxx/), Path(/user/resourcemanager)]] af= ter [182000 ms]. Sender[null] sent message of type "akka.actor.Identify".. >> 2019-07-09 14:01:59,137 ERROR org.apache.flink.runtime.taskexecutor.Task= Executor - Fatal error occurred in TaskExecutor akka.tcp://flink= @xxx/user/taskmanager_0. >> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutExce= ption: Could not register at the ResourceManager within the specified maxim= um registration duration 300000 ms. This indicates a problem with this inst= ance. Terminating now. >> at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeo= ut(TaskExecutor.java:1023) >> at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegis= trationTimeout$3(TaskExecutor.java:1009) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRp= cActor.java:332) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(Akka= RpcActor.java:158) >> at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActo= r.java:142) >> at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.= scala:165) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> at akka.actor.ActorCell.invoke(ActorCell.scala:495) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPoo= l.java:1339) >> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1= 979) >> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThr= ead.java:107) >> >> >> > --000000000000486b1c058d4dad5c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for the kindly offer, Qi.=C2=A0

= I think this work should not take much time, so I can take care of it. It&#= 39;s just the community is currently under feature freeze for release 1.9, = so we need to wait until the release code branch being cut.
<= div>


Thank you~

= Xintong Song



On Wed, Jul 10, 2019 at 1:55 PM qi luo <luoqi.bd@gmail.com> wrote:
Thanks Xintong and Haibo, I=E2=80=99ve found the fix in the blink = branch.=C2=A0

We=E2=80=99re also glad to help contribute= this patch to community version, in case you don=E2=80=99t have time.

Regards,
Qi

On Jul 10, 2019, at 11:51 AM, Haibo Sun <sunhaibotb@163.com> wrote:
=
Hi, Qi

Sorry, by talking to= =C2=A0Xintong Song=C2=A0offline, I made sure that some of what I sai= d before is wrong. Please refer to=C2=A0Xintong's answer.
Best,
Haibo

At 2019-07-10 = 11:37:16, "Xintong Song" <tonysong820@gmail.com> wrote:
Hi Qi,

Thanks for reporting this problem. I think you= are right. Starting large amount of TMs in main thread on YARN could take = relative long time, causing RM to become unresponsive. In our enterprise ve= rsion Blink, we actually have a thread pool for starting TMs. I think we sh= ould contribute this feature to the community version as well. I created a = JIRA ticket from which you can track the progress of this issue.= =C2=A0

For solving your problem at the moment, I a= gree with Haibo that=C2=A0configure a larger registration timeout could be = a workaround.

Thank yo= u~
Xint= ong Song

=


<= /div>

On Wed, Jul 10, 2019 at 10:37 AM Haibo Sun <sunhaibotb@163.com> wrote:
Hi, Qi

According to our experience, it is no= problem to allocate more than 1000 containers when the registration timeou= t is set 5 minutes . Perhaps there are other reasons? Or you can try to inc= rease the value of `taskmanager.registration.timeout`. For allocating conta= iners using multi-thread, I personally think it's going to get very com= plicated, and the more recommended way is to put some waiting works into as= ynchronous processing, so as to liberate the main thread.

Best,
Haibo


At 2019-07-09 21:05:51, "qi luo" <luoqi.bd@gmail.com>= wrote:
Hi guys,

We=E2= =80=99re using latest version Flink YarnResourceManager, but our job startu= p occasionally hangs when allocating many Yarn containers (e.g. >1000). = I checked the related code in YarnResourceManager as below:

<= /div>
<PastedGraphic-1.tiff>
=
It seems that it handles all allocated containers and starts= TM in main thread. Thus when containers allocations are heavy, the RM thre= ad becomes unresponsive (such as no response to TM heartbeats, see TM logs = as below).

Any idea on how to better handle such c= ase (e.g. multi-threading to handle allocated containers) would be very app= reciated. Thanks!

Regards,
Qi=C2=A0


=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2= =80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80= =94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94= =E2=80=94=E2=80=94=E2=80=94=C2=A0
TM log:

2019-=
07-09 13:56:59,110 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor=
            - Connecting to ResourceManager akka.tcp://flink@xxx/user/re=
sourcemanager(00000000000000000000000000000000).
2019-07-09 14:00:01,138 INFO  org.apache.flink.runtime.taskexecutor.TaskExe=
cutor            - Could not resolve ResourceManager address akka.tcp://=
flink@xxx/user/resourcemanager, retrying in 10000 ms: Ask timed out on =
[ActorSelection[Anchor(akka.tcp://flink@xxx/), Path(/user/resourcema=
nager)]] after [182000 ms]. Sender[null] sent message of type "akka.ac=
tor.Identify"..
2019-07-09 14:01:59,137 ERROR org.apache.flink.runtime.taskexecutor.TaskExe=
cutor            - Fatal error occurred in TaskExecutor akka.tcp://flink=
@xxx/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutExcepti=
on: Could not register at the ResourceManager within the specified maximum =
registration duration 300000 ms. This indicates a problem with this instanc=
e. Terminating now.
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(=
TaskExecutor.java:1023)
	at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistra=
tionTimeout$3(TaskExecutor.java:1009)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcAc=
tor.java:332)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpc=
Actor.java:158)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.j=
ava:142)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.sca=
la:165)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.j=
ava:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979=
)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread=
.java:107)


--000000000000486b1c058d4dad5c--