flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mu Kong <kong.mu....@gmail.com>
Subject Re: JobManager doesn't recover in HA mode
Date Thu, 01 Feb 2018 07:00:01 GMT
Hi Tony,

Thanks for your response!
I would definitely check supervisord.

I wonder if there is a way that I can recover the killed JM and add it back
to the cluster by using one of the scripts in the *flink/bin/*


Thanks!


Best regards,
Mu


On Thu, Feb 1, 2018 at 3:50 PM, Tony Wei <tony19920430@gmail.com> wrote:

> Hi Mu,
>
> AFAIK, that is the expected behavior when you launch your cluster in
> standalone mode. Flink HA guarantees that the standby JM will take over the
> whole cluster. The illustration just said recovered JM will become another
> standby machine, but recovering a single instance is not the Flink HA's
> responsibility.
> One possible way might be using supervisord [1] to launch your JM
> instance, it can help you monitor your process and automatically restart
> when the process accidentally failed. Or you can use YARN cluster, the YARN
> cluster will be responsible for recovering the dead JM.
>
> Best,
> Tony Wei
>
> [1] http://supervisord.org/
>
> 2018-02-01 14:11 GMT+08:00 Mu Kong <kong.mu.biz@gmail.com>:
>
>> Hi all,
>>
>> I have a Flink HA cluster with 2 job managers and a zookeeper quorum of 3
>> nodes.
>>
>> My failed job manager didn't get recovered after I killed it.
>> Here is how I didn't it and what I've observed:
>>
>> 1. I started the HA cluster with start-cluster.sh
>> 2. Job manager A got elected.
>> 3. I killed job manager A with kill command.
>> 4. Job manager B got elected.
>> 5. Job manager B was working well.
>> 6. But job manager A never recovered since then.
>>
>> Do I miss something here or is it the case that HA cannot handle such
>> failover(the flink instance gets killed directly)?
>>
>> Thanks!
>>
>> Best regards,
>> Mu
>>
>
>

Mime
View raw message