flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qi luo <luoqi...@gmail.com>
Subject Re: YarnResourceManager unresponsive under heavy containers allocations
Date Wed, 10 Jul 2019 11:44:10 GMT
Got it, thank you!

> On Jul 10, 2019, at 2:20 PM, Xintong Song <tonysong820@gmail.com> wrote:
> 
> Thanks for the kindly offer, Qi. 
> 
> I think this work should not take much time, so I can take care of it. It's just the
community is currently under feature freeze for release 1.9, so we need to wait until the
release code branch being cut.
> 
> Thank you~
> Xintong Song
> 
> 
> On Wed, Jul 10, 2019 at 1:55 PM qi luo <luoqi.bd@gmail.com <mailto:luoqi.bd@gmail.com>>
wrote:
> Thanks Xintong and Haibo, I’ve found the fix in the blink branch. 
> 
> We’re also glad to help contribute this patch to community version, in case you don’t
have time.
> 
> Regards,
> Qi
> 
>> On Jul 10, 2019, at 11:51 AM, Haibo Sun <sunhaibotb@163.com <mailto:sunhaibotb@163.com>>
wrote:
>> 
>> Hi, Qi
>> 
>> Sorry, by talking to Xintong Song offline, I made sure that some of what I said before
is wrong. Please refer to Xintong's answer.
>> 
>> Best,
>> Haibo
>> 
>> At 2019-07-10 11:37:16, "Xintong Song" <tonysong820@gmail.com <mailto:tonysong820@gmail.com>>
wrote:
>> Hi Qi,
>> 
>> Thanks for reporting this problem. I think you are right. Starting large amount of
TMs in main thread on YARN could take relative long time, causing RM to become unresponsive.
In our enterprise version Blink, we actually have a thread pool for starting TMs. I think
we should contribute this feature to the community version as well. I created a JIRA ticket
<https://issues.apache.org/jira/browse/FLINK-13184> from which you can track the progress
of this issue. 
>> 
>> For solving your problem at the moment, I agree with Haibo that configure a larger
registration timeout could be a workaround.
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Wed, Jul 10, 2019 at 10:37 AM Haibo Sun <sunhaibotb@163.com <mailto:sunhaibotb@163.com>>
wrote:
>> Hi, Qi
>> 
>> According to our experience, it is no problem to allocate more than 1000 containers
when the registration timeout is set 5 minutes . Perhaps there are other reasons? Or you can
try to increase the value of `taskmanager.registration.timeout`. For allocating containers
using multi-thread, I personally think it's going to get very complicated, and the more recommended
way is to put some waiting works into asynchronous processing, so as to liberate the main
thread.
>> 
>> Best,
>> Haibo
>> 
>> 
>> At 2019-07-09 21:05:51, "qi luo" <luoqi.bd@gmail.com <mailto:luoqi.bd@gmail.com>>
wrote:
>> Hi guys,
>> 
>> We’re using latest version Flink YarnResourceManager, but our job startup occasionally
hangs when allocating many Yarn containers (e.g. >1000). I checked the related code in
YarnResourceManager as below:
>> 
>> <PastedGraphic-1.tiff>
>> 
>> It seems that it handles all allocated containers and starts TM in main thread. Thus
when containers allocations are heavy, the RM thread becomes unresponsive (such as no response
to TM heartbeats, see TM logs as below).
>> 
>> Any idea on how to better handle such case (e.g. multi-threading to handle allocated
containers) would be very appreciated. Thanks!
>> 
>> Regards,
>> Qi 
>> 
>> 
>> ———————————————————————— 
>> TM log:
>> 
>> 2019-07-09 13:56:59,110 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Connecting to ResourceManager akka.tcp://flink@xxx/user/resourcemanager(00000000000000000000000000000000)
<>.
>> 2019-07-09 14:00:01,138 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Could not resolve ResourceManager address akka.tcp://flink@xxx/user/resourcemanager
<>, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@xxx/
<>), Path(/user/resourcemanager)]] after [182000 ms]. Sender[null] sent message of type
"akka.actor.Identify"..
>> 2019-07-09 14:01:59,137 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Fatal error occurred in TaskExecutor akka.tcp://flink@xxx/user/taskmanager_0
<>.
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could
not register at the ResourceManager within the specified maximum registration duration 300000
ms. This indicates a problem with this instance. Terminating now.
>> 	at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1023)
>> 	at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1009)
>> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>> 	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>> 	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>> 	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>> 	at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>> 	at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>> 	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 
> 


Mime
View raw message