flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Issues testing Flink HA w/ ZooKeeper
Date Mon, 15 Feb 2016 11:51:15 GMT
Hi Stefano,

The Job should stop temporarily but then be resumed by the new
JobManager. Have you increased the number of execution retries? AFAIK,
it is set to 0 by default. This will not re-run the job, even in HA
mode. You can enable it on the StreamExecutionEnvironment.

Otherwise, you have probably already found the documentation:


On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
<stefano.baghino@radicalbit.io> wrote:
> Hello everyone,
> last week I've ran some tests with Apache ZooKeeper to get a grip on Flink
> HA features. My tests went bad so far and I can't sort out the reason.
> My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3
> masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) ensemble.
> I've started ZooKeeper on each machine, tested it's availability and then
> started the Flink cluster. Since there's no reliable distributed filesystem
> on the cluster, I had to use the local file system as the state backend.
> I then submitted a very simple streaming job that writes the timestamp on a
> text file on the local file system each second and then went on to kill the
> process running the job manager to verify that another job manager takes
> over. However, the job just stopped. I still have to perform some checks on
> the handover to the new job manager, but before digging deeper I wanted to
> ask if my expectation of having the job going despite the job manager
> failure is unreasonable.
> Thanks in advance.
> --
> BR,
> Stefano Baghino
> Software Engineer @ Radicalbit

View raw message