Thanks. We find the relevant nodemanager log and figured out the lost task manager killed by the yarn due to memory limit. @zhijiang @Biao Liu Thanks for your help.

On Sun, Apr 21, 2019 at 11:45 PM zhijiang <wangzhijiang999@aliyun.com> wrote:
Hi Wenrui,

I think you could trace the log of node manager which contains the lifecycle of this task executor. Maybe this task executor is killed by node manager because of memory overuse.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <wenruimeng@gmail.com>
Send Time:2019年4月20日(星期六) 09:48
To:zhijiang <wangzhijiang999@aliyun.com>
Cc:Biao Liu <mmyy1110@gmail.com>; user <user@flink.apache.org>; tzulitai <tzulitai@apache.org>
Subject:Re: Netty channel closed at AKKA gated status

Attached the lost task manager last 10000 lines log. Anyone can help take a look? 

Thanks,
Wenrui

On Fri, Apr 19, 2019 at 6:32 PM Wenrui Meng <wenruimeng@gmail.com> wrote:
Looked at a few same instances. The lost task manager was indeed not active anymore since there is no log for that task manager printed after the connection issue timestamp. I guess somehow that task manager died silently without exception or termination relevant information logged. I double checked the lost task manager host, the GC, CPU, memory, network, disk I/O all look good without any spike. Is there any other possibility that the task manager can be terminated? We run our jobs in the yarn cluster. 

On Mon, Apr 15, 2019 at 10:47 PM zhijiang <wangzhijiang999@aliyun.com> wrote:
Hi Wenrui,

You might further check whether there exists network connection issue between job master and target task executor if you confirm the target task executor is still alive.

Best,
Zhijiang
------------------------------------------------------------------
From:Biao Liu <mmyy1110@gmail.com>
Send Time:2019年4月16日(星期二) 10:14
To:Wenrui Meng <wenruimeng@gmail.com>
Subject:Re: Netty channel closed at AKKA gated status

Hi Wenrui,
If a task manager is killed (kill -9), it would have no chance to log anything. If the task manager exits since connection timeout, there would be something in log file. So it is probably killed by other user or operating system. Please check the log of operating system. BTW, I don't think "DEBUG log level" would help.

Wenrui Meng <wenruimeng@gmail.com> 于2019年4月16日周二 上午9:16写道:
There is no exception or any warning in the task manager `'athena592-phx2/10.80.118.166:44177'` log. In addition, the host was not shut down either in cluster monitor dashboard. It probably requires to turn on DEBUG log to get more useful information. If the task manager gets killed, I assume there will be terminating log in the task manager log. If not, I don't know how to figure out whether it's due to task manager gets killed or just a connection timeout.



On Sun, Apr 14, 2019 at 7:22 PM zhijiang <wangzhijiang999@aliyun.com> wrote:
Hi Wenrui,

I think the akka gated issue and inactive netty channel are both caused by some task manager exits/killed. You should double check the status and reason of this task manager `'athena592-phx2/10.80.118.166:44177'`.

Best,
Zhijiang
------------------------------------------------------------------
From:Wenrui Meng <wenruimeng@gmail.com>
Send Time:2019年4月13日(星期六) 01:01
Cc:tzulitai <tzulitai@apache.org>
Subject:Netty channel closed at AKKA gated status

We encountered the netty channel inactive issue while the AKKA gated that task manager. I'm wondering whether the channel closed because of the AKKA gated status, since all message to the taskManager will be dropped at that moment, which might cause netty channel exception. If so, shall we have coordination between AKKA and Netty? The gated status is not intended to fail the system. Here is the stack trace fthe or exception

2019-04-12 12:46:38.413 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed checkpoint 3758 (3788228399 bytes in 5967 ms).
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.175 [flink-akka.actor.default-dispatcher-65] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-25 - Association with remote system [akka.tcp://flink@athena592-phx2:44487] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2019-04-12 12:49:14.230 [flink-akka.actor.default-dispatcher-65] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - id (14/96) (93fcbfc535a190e1edcfd913d5f304fe) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'athena592-phx2/10.80.118.166:44177'. This might indicate that the remote task manager was lost.
        at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:117)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)