flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wenrui Meng <wenruim...@gmail.com>
Subject Re: Netty channel closed at AKKA gated status
Date Tue, 23 Apr 2019 00:13:59 GMT
Thanks. We find the relevant nodemanager log and figured out the lost task
manager killed by the yarn due to memory limit. @zhijiang
<wangzhijiang999@aliyun.com> @Biao Liu <mmyy1110@gmail.com> Thanks for your
help.

On Sun, Apr 21, 2019 at 11:45 PM zhijiang <wangzhijiang999@aliyun.com>
wrote:

> Hi Wenrui,
>
> I think you could trace the log of node manager which contains the
> lifecycle of this task executor. Maybe this task executor is killed by node
> manager because of memory overuse.
>
> Best,
> Zhijiang
>
> ------------------------------------------------------------------
> From:Wenrui Meng <wenruimeng@gmail.com>
> Send Time:2019年4月20日(星期六) 09:48
> To:zhijiang <wangzhijiang999@aliyun.com>
> Cc:Biao Liu <mmyy1110@gmail.com>; user <user@flink.apache.org>; tzulitai
<
> tzulitai@apache.org>
> Subject:Re: Netty channel closed at AKKA gated status
>
> Attached the lost task manager last 10000 lines log. Anyone can help take
> a look?
>
> Thanks,
> Wenrui
>
> On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <wenruimeng@gmail.com> wrote:
> Looked at a few same instances. The lost task manager was indeed not
> active anymore since there is no log for that task manager printed after
> the connection issue timestamp. I guess somehow that task manager died
> silently without exception or termination relevant information logged. I
> double checked the lost task manager host, the GC, CPU, memory, network,
> disk I/O all look good without any spike. Is there any other possibility
> that the task manager can be terminated? We run our jobs in the yarn
> cluster.
>
> On Mon, Apr 15, 2019 at 10:47 PM zhijiang <wangzhijiang999@aliyun.com>
> wrote:
> Hi Wenrui,
>
> You might further check whether there exists network connection issue
> between job master and target task executor if you confirm the target task
> executor is still alive.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Biao Liu <mmyy1110@gmail.com>
> Send Time:2019年4月16日(星期二) 10:14
> To:Wenrui Meng <wenruimeng@gmail.com>
> Cc:zhijiang <wangzhijiang999@aliyun.com>; user <user@flink.apache.org>;
> tzulitai <tzulitai@apache.org>
> Subject:Re: Netty channel closed at AKKA gated status
>
> Hi Wenrui,
> If a task manager is killed (kill -9), it would have no chance to log
> anything. If the task manager exits since connection timeout, there would
> be something in log file. So it is probably killed by other user or
> operating system. Please check the log of operating system. BTW, I don't
> think "DEBUG log level" would help.
>
> Wenrui Meng <wenruimeng@gmail.com> 于2019年4月16日周二 上午9:16写道:
> There is no exception or any warning in the task manager
> `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not
> shut down either in cluster monitor dashboard. It probably requires to turn
> on DEBUG log to get more useful information. If the task manager gets
> killed, I assume there will be terminating log in the task manager log. If
> not, I don't know how to figure out whether it's due to task manager gets
> killed or just a connection timeout.
>
>
>
> On Sun, Apr 14, 2019 at 7:22 PM zhijiang <wangzhijiang999@aliyun.com>
> wrote:
> Hi Wenrui,
>
> I think the akka gated issue and inactive netty channel are both caused by
> some task manager exits/killed. You should double check the status and
> reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Wenrui Meng <wenruimeng@gmail.com>
> Send Time:2019年4月13日(星期六) 01:01
> To:user <user@flink.apache.org>
> Cc:tzulitai <tzulitai@apache.org>
> Subject:Netty channel closed at AKKA gated status
>
> We encountered the netty channel inactive issue while the AKKA gated that
> task manager. I'm wondering whether the channel closed because of the AKKA
> gated status, since all message to the taskManager will be dropped at that
> moment, which might cause netty channel exception. If so, shall we have
> coordination between AKKA and Netty? The gated status is not intended to
> fail the system. Here is the stack trace fthe or exception
>
> 2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 3758 (3788228399 bytes in 5967 ms).
> 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN
> akka.remote.ReliableDeliverySupervisor
> flink-akka.remote.default-remote-dispatcher-25 - Association with remote
> system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now
> gated for [5000] ms. Reason: [Disassociated]
> 2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN
> akka.remote.ReliableDeliverySupervisor
> flink-akka.remote.default-remote-dispatcher-25 - Association with remote
> system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now
> gated for [5000] ms. Reason: [Disassociated]
> 2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96)
> (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager 'athena592-phx2/
> 10.80.118.166:44177'. This might indicate that the remote task manager
> was lost.
>         at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
>         at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
>         at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
>         at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
>         at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at java.lang.Thread.run(Thread.java:748)
>
>
>
>

Mime
View raw message