flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Levi <avi.l...@bluevoyant.com>
Subject Re: Getting RemoteTransportException, HA mode
Date Sun, 20 Jan 2019 08:49:40 GMT
Thanks! regarding the HA. So the Job will auto recover after a crash
understood. Is that true also when deploying new version? easy as  simply
canceling one job, updating the version and once it is up & running do the
same to the other one ? is there anything that is should be highlighted
when working in HA mode (e.g configuration, machines setup besides what is
written in the Job manager HA documentation
<https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/jobmanager_high_availability.html>
) ?

On Thu, Jan 17, 2019 at 9:39 PM Jamie Grier <jgrier@lyft.com> wrote:

> Avi,
>
> The stack trace there is pretty much a red herring.  That happens whenever
> a job shuts down for any reason and is not a root cause.  To diagnose this
> you will want to look at all the TaskManager logs as well as the JobManager
> logs.  If you have a way to easily grep these (all of them at once) I would
> search for a string like "to FAILED" on the taskmanagers and look at those
> error lines and stacktraces.
>
> Don't be misled by the exception reported in the Flink UI.  It OFTEN isn't
> the true root cause but it's a hard problem to solve.  You have to look at
> the TaskManager logs to really be sure.
>
> The taskmanager.network.netty.server.numThreads
> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>
is
> also a red herring.  I would leave that alone.
>
> Finally, if you have HA and checkpointing setup correctly you will not
> lose any state even in the case of losing a JobManager.  The job will
> auto-recover as soon as a new JobManager becomes available.
>
> I hope that helps.
>
> -Jamie
>
>
>
> On Thu, Jan 17, 2019 at 7:10 AM Dominik Wosiński <wossyn@gmail.com> wrote:
>
>> *Hey,*
>> As for the question about  taskmanager.network.netty.server.numThreads
>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>.
>> It is the size of the thread pool that will be used by the netty server.
>> The default value is -1, which will result in the thread pool with size
>> equal to the number of task slots for your JobManager.
>>
>> Best Regards,
>> Dom.
>>
>> czw., 17 sty 2019 o 00:52 Avi Levi <avi.levi@bluevoyant.com> napisał(a):
>>
>>> Hi Guys,
>>>
>>> We done some load tests and we got the exception below, I saw that the
>>> JobManager was restarted, If I understood correctly, it will get new job id
>>> and the state will lost - is that correct? how the state is handled setting
>>> HA as described here
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/jobmanager_high_availability.html>,
what
>>> actually happens to the state if one of the job manager crashes (keyed
>>> state using rocks db) ?
>>>
>>>
>>> One of the property that might be relevant to this exception is
>>> taskmanager.network.netty.server.numThreads
>>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#taskmanager-network-netty-server-numthreads>
with
>>> a default value of -1 - what is this default value actually means?  and
>>> should it be set to different value according to #cores?
>>>
>>>
>>> Thanks for your advice .
>>>
>>> Avi
>>>
>>>
>>>
>>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>> Lost connection to task manager 'xxxx:1234'. This indicates that the remote
>>> task manager was lost.
>>>
>>> at
>>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:160)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:87)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>>>
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> Caused by: java.io.IOException: Connection reset by peer
>>>
>>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>
>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>
>>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>
>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>>
>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345)
>>>
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>>>
>>> ... 6 more
>>>
>>

Mime
View raw message