flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dinesh J <dineshj...@gmail.com>
Subject Re: Issue with single job yarn flink cluster HA
Date Fri, 03 Apr 2020 07:45:06 GMT
Hi Andrey,
Sure We will try to use Flink 1.10 to see if HA issues we are facing is
fixed and update in this thread.

Thanks,
Dinesh

On Thu, Apr 2, 2020 at 3:22 PM Andrey Zagrebin <azagrebin@apache.org> wrote:

> Hi Dinesh,
>
> Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g.
> [1] and [2].
> I would suggest to try Flink 1.10.
> If the problem persists, could you also find the logs of the failed Job
> Manager before the failover?
>
> Best,
> Andrey
>
> [1] https://jira.apache.org/jira/browse/FLINK-14316
> [2] https://jira.apache.org/jira/browse/FLINK-11843
>
> On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <dineshj.86@gmail.com> wrote:
>
>> Hi Yang,
>> I am attaching one full jobmanager log for a job which I reran today.
>> This a job that tries to read from savepoint.
>> Same error message "leader election onging" is displayed. And this stays
>> the same even after 30 minutes. If I leave the job without yarn kill, it
>> stays the same forever.
>> Based on your suggestions till now, I guess it might be some zookeeper
>> problem. If that is the case, what can I lookout for in zookeeper to figure
>> out the issue?
>>
>> Thanks,
>> Dinesh
>>
>>
>> On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <danrtsey.wy@gmail.com> wrote:
>>
>>> I think your problem is not about akka timeout. Increase the timeout
>>> could help in a
>>> heavy load cluster, especially for the network is not very good.
>>> However, that is not
>>> your case now.
>>>
>>> I am not sure about the "never recovery". Do you mean the logs
>>> "Connection refused"
>>> keep going and do not have other logs? How long does it stay in "leader
>>> election onging".
>>> Usually, it takes at most 60s. Since if the old jobmanager crashed, then
>>> it will lose
>>> the leadership after zookeeper session timeout. So when the new
>>> jobmanager always
>>> could not grant the leadership, it may because of some problem of
>>> zookeeper.
>>>
>>> Maybe you need to share the complete jobmanager logs so that we could
>>> know what
>>> is happening in the jobmanager.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>>
>>> Dinesh J <dineshj.86@gmail.com> 于2020年3月31日周二 上午3:46写道:
>>>
>>>> HI Yang,
>>>> Thanks for the clarification and suggestion. But my problem was that
>>>> recovery never happens and the message "leader election ongoing" is what
>>>> the message displayed forever.
>>>> Do you think increasing akka.ask.timeout and akka.tcp.timeout will help
>>>> in case of a heavy/highload cluster as this issue happens mainly during
>>>> heavy load in cluster?
>>>>
>>>> Best,
>>>> Dinesh
>>>>
>>>> On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <danrtsey.wy@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Dinesh,
>>>>>
>>>>> First, i think the error message your provided is not a problem. It
>>>>> just indicates that the leader
>>>>> election is still ongoing. When it finished, the new leader will start
>>>>> the a new dispatcher to provide
>>>>> the webui and rest service.
>>>>>
>>>>> From your jobmanager logs "Connection refused:
>>>>> host1/ipaddress1:28681", we could know that
>>>>> the old jobmanager has failed. When a new jobmanager started, since
>>>>> the old jobmanager still
>>>>> hold the lock of leader latch. So Flink tries to connect with it.
>>>>> After it tries few times, since the old
>>>>> jobmanager zookeeper client do not update the leader latch, then the
>>>>> new jobmanager will elect
>>>>> successfully and be the active leader. It is just how the leader
>>>>> election works.
>>>>>
>>>>> In a nutshell, the root cause is old jobmanager crashed and it does
>>>>> not lose the leader immediately.
>>>>> It is the by-design behavior.
>>>>>
>>>>> If you really want to make the recovery faster, i think you could
>>>>> decrease "high-availability.zookeeper.client.connection-timeout"
>>>>> and "high-availability.zookeeper.client.session-timeout". Please keep
>>>>> in mind that too small value
>>>>> will also cause unexpected failover because of network problem.
>>>>>
>>>>>
>>>>> Best,
>>>>> Yang
>>>>>
>>>>> Dinesh J <dineshj.86@gmail.com> 于2020年3月25日周三 下午4:20写道:
>>>>>
>>>>>> Hi Andrey,
>>>>>> Yes . The job is not restarting sometimes after the current leader
>>>>>> failure.
>>>>>> Below is the message displayed when trying to reach the application
>>>>>> master url via yarn ui and message remains the same even if the yarn
job is
>>>>>> running for 2 days.
>>>>>> During this time , even current yarn application attempt is not
>>>>>> getting failed and no containers are launched for jobmanager and
>>>>>> taskmanager.
>>>>>>
>>>>>> *{"errors":["Service temporarily unavailable due to an ongoing leader
>>>>>> election. Please refresh."]}*
>>>>>>
>>>>>> Thanks,
>>>>>> Dinesh
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <azagrebin@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dinesh,
>>>>>>>
>>>>>>> If the current leader crashes (e.g. due to network failures)
then
>>>>>>> getting these messages do not look like a problem during the
leader
>>>>>>> re-election.
>>>>>>> They look to me just as warnings that caused failover.
>>>>>>>
>>>>>>> Do you observe any problem with your application? Does the failover
>>>>>>> not work, e.g. no leader is elected or a job is not restarted
after the
>>>>>>> current leader failure?
>>>>>>>
>>>>>>> Best,
>>>>>>> Andrey
>>>>>>>
>>>>>>> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <dineshj.86@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Attaching the job manager log for reference.
>>>>>>>>
>>>>>>>> 2020-03-22 11:39:02,693 WARN
>>>>>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever
 -
>>>>>>>> Error while retrieving the leader gateway. Retrying to connect
to
>>>>>>>> akka.tcp://flink@host1:28681/user/dispatcher.
>>>>>>>> 2020-03-22 11:39:02,724 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:02,724 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:02,791 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:02,792 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:02,861 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:02,861 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:02,931 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:02,931 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,001 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,002 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,071 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,071 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,141 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,141 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,211 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,211 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,281 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,282 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,351 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,351 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>> 2020-03-22 11:39:03,421 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport             
      - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException:
Connection
>>>>>>>> refused: host1/ipaddress1:28681
>>>>>>>> 2020-03-22 11:39:03,421 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                 
      -
>>>>>>>> Association with remote system [akka.tcp://flink@host1:28681]
has
>>>>>>>> failed, address is now gated for [50] ms. Reason: [Association
failed with
>>>>>>>> [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>>>>>>>> host1/ipaddress1:28681]
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Dinesh
>>>>>>>>
>>>>>>>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj.86@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> We have single job yarn flink cluster setup with High
Availability.
>>>>>>>>> Sometimes job manager failure successfully restarts next
attempt
>>>>>>>>> from current checkpoint.
>>>>>>>>> But occasionally we are getting below error.
>>>>>>>>>
>>>>>>>>> {"errors":["Service temporarily unavailable due to an
ongoing leader election. Please refresh."]}
>>>>>>>>>
>>>>>>>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>>>>>>>>>
>>>>>>>>> Flink version: flink-1.7.2
>>>>>>>>>
>>>>>>>>> Zookeeper version: 3.4.6-169--1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Below is the flink configuration*
>>>>>>>>>
>>>>>>>>> high-availability: zookeeper
>>>>>>>>>
>>>>>>>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>>>>>>>>>
>>>>>>>>> high-availability.storageDir: hdfs:///flink/ha
>>>>>>>>>
>>>>>>>>> high-availability.zookeeper.path.root: /flink
>>>>>>>>>
>>>>>>>>> yarn.application-attempts: 10
>>>>>>>>>
>>>>>>>>> state.backend: rocksdb
>>>>>>>>>
>>>>>>>>> state.checkpoints.dir: hdfs:///flink/checkpoint
>>>>>>>>>
>>>>>>>>> state.savepoints.dir: hdfs:///flink/savepoint
>>>>>>>>>
>>>>>>>>> jobmanager.execution.failover-strategy: region
>>>>>>>>>
>>>>>>>>> restart-strategy: failure-rate
>>>>>>>>>
>>>>>>>>> restart-strategy.failure-rate.max-failures-per-interval:
3
>>>>>>>>>
>>>>>>>>> restart-strategy.failure-rate.failure-rate-interval:
5 min
>>>>>>>>>
>>>>>>>>> restart-strategy.failure-rate.delay: 10 s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can someone let know if I am missing something or is
it a known issue?
>>>>>>>>>
>>>>>>>>> Is it something related to hostname ip mapping issue
or zookeeper version issue?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Dinesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Mime
View raw message