flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mu Kong <kong.mu....@gmail.com>
Subject Re: JobManager doesn't recover in HA mode
Date Thu, 01 Feb 2018 07:04:56 GMT
Ah, I think I can just use ./bin/jobmanager.sh
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html#adding-a-jobmanager

Thanks!

On Thu, Feb 1, 2018 at 4:00 PM, Mu Kong <kong.mu.biz@gmail.com> wrote:

> Hi Tony,
>
> Thanks for your response!
> I would definitely check supervisord.
>
> I wonder if there is a way that I can recover the killed JM and add it
> back to the cluster by using one of the scripts in the *flink/bin/*
>
>
> Thanks!
>
>
> Best regards,
> Mu
>
>
> On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <tony19920430@gmail.com> wrote:
>
>> Hi Mu,
>>
>> AFAIK, that is the expected behavior when you launch your cluster in
>> standalone mode. Flink HA guarantees that the standby JM will take over the
>> whole cluster. The illustration just said recovered JM will become another
>> standby machine, but recovering a single instance is not the Flink HA's
>> responsibility.
>> One possible way might be using supervisord [1] to launch your JM
>> instance, it can help you monitor your process and automatically restart
>> when the process accidentally failed. Or you can use YARN cluster, the YARN
>> cluster will be responsible for recovering the dead JM.
>>
>> Best,
>> Tony Wei
>>
>> [1] http://supervisord.org/
>>
>> 2018-02-01 14:11 GMT+08:00 Mu Kong <kong.mu.biz@gmail.com>:
>>
>>> Hi all,
>>>
>>> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of
>>> 3 nodes.
>>>
>>> My failed job manager didn't get recovered after I killed it.
>>> Here is how I didn't it and what I've observed:
>>>
>>> 1. I started the HA cluster with start-cluster.sh
>>> 2. Job manager A got elected.
>>> 3. I killed job manager A with kill command.
>>> 4. Job manager B got elected.
>>> 5. Job manager B was working well.
>>> 6. But job manager A never recovered since then.
>>>
>>> Do I miss something here or is it the case that HA cannot handle such
>>> failover(the flink instance gets killed directly)?
>>>
>>> Thanks!
>>>
>>> Best regards,
>>> Mu
>>>
>>
>>
>

Mime
View raw message