From user-return-25417-archive-asf-public=cust-asf.ponee.io@flink.apache.org Thu Jan 17 16:10:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4344418063F for ; Thu, 17 Jan 2019 16:10:03 +0100 (CET) Received: (qmail 42154 invoked by uid 500); 17 Jan 2019 15:10:02 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 42144 invoked by uid 99); 17 Jan 2019 15:10:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2019 15:10:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7E413C0188 for ; Thu, 17 Jan 2019 15:10:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id qZIrn08U4RJR for ; Thu, 17 Jan 2019 15:09:59 +0000 (UTC) Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2E0DA5F535 for ; Thu, 17 Jan 2019 15:09:59 +0000 (UTC) Received: by mail-wr1-f52.google.com with SMTP id t6so11340439wrr.12 for ; Thu, 17 Jan 2019 07:09:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=S1pC9D9tY51Mf3Xml1nYwlOk4X79IWjZGPRmKh5/l50=; b=H8SL4VJ/YXUrt75Yi5sFd2C1YeO1w6Cv0ORxfZuDvpdzXwV311GGNc3cao7oy0E4cr vGnLiW0Mah7G9TLr62URtHOUjYK4l2SEaw98kTU0DOpzwqDi+7B6UBC3ZZvAALm2oZU6 5OAiNUyBrzrjYs9ATevYTLsd8qSPqJnM6KfRMCYFwyaeAx0OSSqzIgXpMUrKbt52dDND a0afrfY140bfy2x4ffbkD9FuAMR0TIdn/AUfeCk0viDqNGq5dacYbFPWLD1xRj6wuRfy WrT3+jvmwQXIGTS1h7Ui1mbVWuhYBouixpJxY38TqEsIfoS8UcQAkiex88d2oSHsd6IH Tjrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=S1pC9D9tY51Mf3Xml1nYwlOk4X79IWjZGPRmKh5/l50=; b=Zdwpt27ABy/3sv7MUqZ6RGz3Ry+RpmQ+E5bweOt+ulLcuwJzjqX3CfIh0BxKuT39mo mYmbbKDNEG5rKs2V/65Kta/PSQqlLARvu5637vlCAB++Z+tKVcQ8RrZpKORIYWG62tmK TEfxy5B6bCAeo/D8CHFaoIz6bA33xxDFnS1W5lp5mS2WRUD6LuTFZCM9+fB+XPwr9rN5 HQoVge0qEavlrnRJlra+aPhTH+nGfBkL+622YRBYh6hKI+PD0gd0uKzuSLiByjwhKq0k fg1U62XaWZMye7UH1i/HcTK2B4lkeIygMvGVwERxmMyhLEa/o2IoJLD6h4hK40PV6wjP 5WzQ== X-Gm-Message-State: AJcUukd6bhQ7iF0Xqszl+sruCvWQCP6IfKTjJSB9X8XsR76b3JaPZlTs bN8hWj3iHtthr3ZNzeJkUtUvG8kg5C/oCpsHs4cJf6B5 X-Google-Smtp-Source: ALg8bN7Svo8qMV142shSgUsr00FlsF41h9Yr71re6oUx+8Y9QVwXSALPjeFeOp6VNC/Wm2acX7kpp8SGixdNq/NcjvE= X-Received: by 2002:a5d:4652:: with SMTP id j18mr12715324wrs.279.1547737797878; Thu, 17 Jan 2019 07:09:57 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?Q?Dominik_Wosi=C5=84ski?= Date: Thu, 17 Jan 2019 16:09:45 +0100 Message-ID: Subject: Re: Getting RemoteTransportException To: Avi Levi Cc: user Content-Type: multipart/alternative; boundary="000000000000a5e6cb057fa8c98b" --000000000000a5e6cb057fa8c98b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable *Hey,* As for the question about taskmanager.network.netty.server.numThreads . It is the size of the thread pool that will be used by the netty server. The default value is -1, which will result in the thread pool with size equal to the number of task slots for your JobManager. Best Regards, Dom. czw., 17 sty 2019 o 00:52 Avi Levi napisa=C5=82(a= ): > Hi Guys, > > We done some load tests and we got the exception below, I saw that the > JobManager was restarted, If I understood correctly, it will get new job = id > and the state will lost - is that correct? how the state is handled setti= ng > HA as described here > , what > actually happens to the state if one of the job manager crashes (keyed > state using rocks db) ? > > > One of the property that might be relevant to this exception is > taskmanager.network.netty.server.numThreads > with > a default value of -1 - what is this default value actually means? and > should it be set to different value according to #cores? > > > Thanks for your advice . > > Avi > > > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportExcept= ion: > Lost connection to task manager 'xxxx:1234'. This indicates that the remo= te > task manager was lost. > > at > org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClie= ntHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:160= ) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.fireExceptionCaught(AbstractChannelHandlerContext.java:256) > > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdap= ter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.fireExceptionCaught(AbstractChannelHandlerContext.java:256) > > at > org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exc= eptionCaught(ChannelHandlerAdapter.java:87) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.fireExceptionCaught(AbstractChannelHandlerContext.java:256) > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$He= adContext.exceptionCaught(DefaultChannelPipeline.java:1401) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) > > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerCon= text.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) > > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fi= reExceptionCaught(DefaultChannelPipeline.java:953) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChanne= l$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChanne= l$NioByteUnsafe.read(AbstractNioByteChannel.java:174) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processS= electedKey(NioEventLoop.java:645) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processS= electedKeysOptimized(NioEventLoop.java:580) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processS= electedKeys(NioEventLoop.java:497) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioE= ventLoop.java:459) > > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEvent= Executor$5.run(SingleThreadEventExecutor.java:884) > > at java.lang.Thread.run(Thread.java:748) > > Caused by: java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > > at > org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.= setBytes(PooledUnsafeDirectByteBuf.java:288) > > at > org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes= (AbstractByteBuf.java:1108) > > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChann= el.doReadBytes(NioSocketChannel.java:345) > > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChanne= l$NioByteUnsafe.read(AbstractNioByteChannel.java:148) > > ... 6 more > --000000000000a5e6cb057fa8c98b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hey,
As for the question about=C2=A0=C2=A0taskmanager.network.netty.server.numThreads. It i= s the size of the thread pool that will be used by the netty server. The de= fault value is -1, which will result in the thread pool with size equal to = the number of task slots for your JobManager.

Best Regards,
Dom.<= /span>

czw., 17 = sty 2019 o 00:52=C2=A0Avi Levi <avi.levi@bluevoyant.com> napisa=C5=82(a):

Hi G= uys,=C2=A0

We done some load tests and=C2=A0we got the exception below, I sa= w that the JobManager was restarted, If I understood correctly, it will get= new job id and the state will lost - is that correct? how the state is han= dled setting HA as described here,=C2=A0what actually happens to the state if one of the job = manager crashes (keyed state using rocks db) ?


One of= the property that might be relevant to this exception is=C2=A0taskmanager.network.netty.server.numThreads= =C2=A0with a default value of -1 - what is this default value actually = means?=C2=A0 and should it be set to different value according to #cores?


Thanks for your advice .

Avi



org.apache.flink.runtime.io.network.netty.exception.RemoteTranspo= rtException: Lost connection to task manager 'xxxx:1234'. This indi= cates that the remote task manager was lost.

at org.apache.flink.runtime.io.network.netty.C= reditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitio= nRequestClientHandler.java:160)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:285)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:264)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandl= erContext.java:256)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdap= ter.java:131)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:285)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:264)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandl= erContext.java:256)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:87)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:285)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:264)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandl= erContext.java:256)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipel= ine.java:1401)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:285)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHan= dlerContext.java:264)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java= :953)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractN= ioByteChannel.java:125)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.j= ava:174)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.NioEventLoop.run(NioEventLoop.java:459)

at org.apache.flink.shaded.netty4.io.netty.uti= l.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java= :884)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native = Method)

at sun.nio.ch.SocketDispatcher.read(SocketDisp= atcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUt= il.java:223)

at sun.nio.ch.IOUtil.read(IOUtil.java:192)

at sun.nio.ch.SocketChannelImpl.read(SocketCha= nnelImpl.java:380)

at org.apache.flink.shaded.netty4.io.netty.buf= fer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)<= /span>

at org.apache.flink.shaded.netty4.io.netty.buf= fer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345)

at org.apache.flink.shaded.netty4.io.netty.cha= nnel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.j= ava:148)

... 6 more

--000000000000a5e6cb057fa8c98b--