flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Grier <jgr...@lyft.com>
Subject Re: Getting RemoteTransportException
Date Thu, 17 Jan 2019 19:39:35 GMT
Avi,

The stack trace there is pretty much a red herring.  That happens whenever
a job shuts down for any reason and is not a root cause.  To diagnose this
you will want to look at all the TaskManager logs as well as the JobManager
logs.  If you have a way to easily grep these (all of them at once) I would
search for a string like "to FAILED" on the taskmanagers and look at those
error lines and stacktraces.

Don't be misled by the exception reported in the Flink UI.  It OFTEN isn't
the true root cause but it's a hard problem to solve.  You have to look at
the TaskManager logs to really be sure.

The taskmanager.network.netty.server.numThreads
<https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>
is
also a red herring.  I would leave that alone.

Finally, if you have HA and checkpointing setup correctly you will not lose
any state even in the case of losing a JobManager.  The job will
auto-recover as soon as a new JobManager becomes available.

I hope that helps.

-Jamie



On Thu, Jan 17, 2019 at 7:10 AM Dominik Wosiński <wossyn@gmail.com> wrote:

> *Hey,*
> As for the question about  taskmanager.network.netty.server.numThreads
> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>.
> It is the size of the thread pool that will be used by the netty server.
> The default value is -1, which will result in the thread pool with size
> equal to the number of task slots for your JobManager.
>
> Best Regards,
> Dom.
>
> czw., 17 sty 2019 o 00:52 Avi Levi <avi.levi@bluevoyant.com> napisał(a):
>
>> Hi Guys,
>>
>> We done some load tests and we got the exception below, I saw that the
>> JobManager was restarted, If I understood correctly, it will get new job id
>> and the state will lost - is that correct? how the state is handled setting
>> HA as described here
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/jobmanager_high_availability.html>,
what
>> actually happens to the state if one of the job manager crashes (keyed
>> state using rocks db) ?
>>
>>
>> One of the property that might be relevant to this exception is
>> taskmanager.network.netty.server.numThreads
>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>
with
>> a default value of -1 - what is this default value actually means?  and
>> should it be set to different value according to #cores?
>>
>>
>> Thanks for your advice .
>>
>> Avi
>>
>>
>>
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Lost connection to task manager 'xxxx:1234'. This indicates that the remote
>> task manager was lost.
>>
>> at
>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:160)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:87)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>>
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.io.IOException: Connection reset by peer
>>
>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>
>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345)
>>
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>>
>> ... 6 more
>>
>

Mime
View raw message