flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charith Dhanushka Wickramarachchi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4650) Frequent task manager disconnects from JobManager
Date Tue, 15 Aug 2017 14:04:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127281#comment-16127281
] 

Charith Dhanushka Wickramarachchi commented on FLINK-4650:
----------------------------------------------------------

I am seeing the similar behavior on 1.3.1. Here is stack trace, I could not find anything
specific in individual task manager logs that may have caused this issue.  

Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception:
Connecting the channel failed: Connecting to remote task manager + 'worker/127.0.1.1:44352'
has failed. This might indicate that the remote task manager has been lost.
	at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by: java.io.IOException: Connecting the channel failed: Connecting to remote task manager
+ 'worker/127.0.1.1:44352' has failed. This might indicate that the remote task manager has
been lost.
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
	at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:112)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:433)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:455)
	at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:86)
	at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:42)
	at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
	at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ReadingThread.go(UnilateralSortMerger.java:973)
	at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting
to remote task manager + 'worker/127.0.1.1:44352' has failed. This might indicate that the
remote task manager has been lost.
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:131)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: worker/127.0.1.1:44352
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)

> Frequent task manager disconnects from JobManager
> -------------------------------------------------
>
>                 Key: FLINK-4650
>                 URL: https://issues.apache.org/jira/browse/FLINK-4650
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, Network
>    Affects Versions: 1.2.0
>            Reporter: Nagarjun Guraja
>
> Not sure of the exact reason but we observe more frequent task manager disconnects while
using 1.2 snapshot build as compared to 1.1.2 release build



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message