From user-return-25154-archive-asf-public=cust-asf.ponee.io@flink.apache.org Fri Jan 4 14:27:18 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 929CC180660 for ; Fri, 4 Jan 2019 14:27:17 +0100 (CET) Received: (qmail 71808 invoked by uid 500); 4 Jan 2019 13:27:16 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 71799 invoked by uid 99); 4 Jan 2019 13:27:16 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2019 13:27:16 +0000 Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id 7187021E8 for ; Fri, 4 Jan 2019 13:27:15 +0000 (UTC) Received: by mail-qt1-f172.google.com with SMTP id e5so40272751qtr.12 for ; Fri, 04 Jan 2019 05:27:15 -0800 (PST) X-Gm-Message-State: AJcUukekWoMZRfYjF9/ZjN4yIlzJ4duEPophiKPbDguut69f7Qe9PcxV m57N2NGwHaeSmdidTlWRpPkwMFLHAIr2px3Z8rA= X-Google-Smtp-Source: ALg8bN5o16Qx2K/nrjhKK8fDdJ3wW2IALTvql8v10WfjDVUAnhcfB53oKLIJDmR16GGf1rh0aFO9jHP3dKHuVh/KpyA= X-Received: by 2002:ac8:8e1:: with SMTP id y30mr50174167qth.3.1546608435049; Fri, 04 Jan 2019 05:27:15 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Till Rohrmann Date: Fri, 4 Jan 2019 14:26:38 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: ConnectTimeoutException when createPartitionRequestClient To: Wenrui Meng Cc: user , Konstantin@data-artisans.com Content-Type: multipart/alternative; boundary="00000000000060bac3057ea1d624" --00000000000060bac3057ea1d624 Content-Type: text/plain; charset="UTF-8" Hi Wenrui, from the logs I cannot spot anything suspicious. Which configuration parameters have you changed exactly? Does the JobManager log contain anything suspicious? The current Flink version changed quite a bit wrt 1.4. Thus, it might be worth a try to run the job with the latest Flink version. Cheers, Till On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng wrote: > Hi, > > I consistently get connection timeout issue when creating > partitionRequestClient in flink 1.4. I tried to ping from the connecting > host to the connected host, but the ping latency is less than 0.1 ms > consistently. So it's probably not due to the cluster status. I also tried > increase max backoff, nettowrk timeout and some other setting, it doesn't > help. > > I enabled the debug log of flink but not find any suspicious or useful > information to help me fix the issue. Here is the link > > of the jobManager and taskManager logs. The connecting host is the host > which throw the exception. The connected host is the host the connecting > host try to request partition from. > > Since our platform is not up to date yet, the flink version I used in this > is 1.4. But I noticed that there is not much change of these code on the > Master branch. Any help will be appreciated. > > Here is stack trace of the exception > > from RUNNING to FAILED. > java.io.IOException: Connecting the channel failed: Connecting to remote > task manager + 'athena485-sjc1/10.70.132.8:34185' has failed. This might > indicate that the remote task manager has been lost. > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197) > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132) > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84) > at > org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59) > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480) > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502) > at > org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185' > has failed. This might indicate that the remote task manager has been lost. > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220) > at > org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 common frames omitted > Caused by: > org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: > connection timed out: athena485-sjc1/10.70.132.8:34185 > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212) > ... 6 common frames omitted > > Thanks, > Wenrui > --00000000000060bac3057ea1d624 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Wenrui,

from the logs I cannot spot = anything suspicious. Which configuration parameters have you changed exactl= y? Does the JobManager log contain anything suspicious?

The current Flink version changed quite a bit wrt 1.4. Thus, it might= be worth a try to run the job with the latest Flink version.
Cheers,
Till

On Thu, Jan 3, 2019 at 3:00 PM Wenrui Meng <wenruimeng@gmail.com> wrote:
Hi,

=
I consistently get connection timeout issue when creating partitionReq= uestClient in flink=C2=A01.4. I tried to ping from the connecting host to t= he connected host, but the ping latency is less than 0.1 ms consistently. S= o it's probably not due to the cluster status. I also tried increase=C2= =A0max backoff, nettowrk timeout and some other setting, it doesn't hel= p.=C2=A0

I enabled the debug log of flink but not = find any suspicious or useful information to help me fix the issue. Here is= the link of the jobManager and taskMana= ger logs. The connecting host is the host which throw the exception. The co= nnected host is the host the connecting host try to request partition from.= =C2=A0

Since our platform is not up to date yet, t= he flink version I used in this is 1.4. But I noticed that there is not muc= h change of these code on the Master branch. Any help will be appreciated.= =C2=A0

Here is stack trace of the exception
<= div>
from RUNNING to FAILED.
java.io.IOExcepti= on: Connecting the channel failed: Connecting to remote task manager + '= ;athena485-sjc1/10.7= 0.132.8:34185' has failed. This might indicate that the remote task= manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFact= ory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197= )
at org.apache.flin= k.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.= access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty= .PartitionRequestClientFactory.createPartitionRequestClient(PartitionReques= tClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.cre= atePartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.networ= k.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChan= nel.java:156)
at org= .apache.flink.runtime.io.network.partition.consumer.SingleInputGate.request= Partitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.= SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.streaming.ru= ntime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
at org.apache.flink.streaming= .runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214= )
at org.apache.flin= k.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69= )
at org.apache.flin= k.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
= at org.apache.flink.runtime.ta= skmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by:= org.apache.flink.runtime.io.network.netty.exception.RemoteTransportExcepti= on: Connecting to remote task manager + 'athena485-sjc1/10.70.132.8:34185' has fail= ed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime= .io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operation= Complete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.P= artitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionR= equestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.Defaul= tPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.ut= il.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shad= ed.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPr= omise.java:563)
at o= rg.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFa= ilure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractN= ioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
at org.apache.flink.shaded.netty= 4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.jav= a:38)
at org.apache.= flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(Schedu= ledFutureTask.java:120)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThread= EventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at org.apache.flink.shaded.netty4= .io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at org.apache.flink.shaded.netty4= .io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEvent= Executor.java:111)
.= .. 1 common frames omitted
Caused by: org.apache.flink.shaded.net= ty4.io.netty.channel.ConnectTimeoutException: connection timed out: athena4= 85-sjc1/10.70.132.8:= 34185
at org.apa= che.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNio= Unsafe$1.run(AbstractNioChannel.java:212)
... 6 common frames omitted

Thanks,
Wenrui
--00000000000060bac3057ea1d624--