flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Issues testing Flink HA w/ ZooKeeper
Date Mon, 15 Feb 2016 12:45:40 GMT
Hi Stefano,

That is true. The documentation doesn't mention that. Just wanted to
point you to the documentation if anything else needs to be
configured. We will update it.

Instead of setting the number of execution retries on the
StreamExecutionEnvironment, you may also set
"execution-retries.default" in the flink-conf.yaml. Let us know if
that fixes your setup.

Cheers,
Max

On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino
<stefano.baghino@radicalbit.io> wrote:
> Hi Maximilian,
>
> thank you for the reply. I've checked out the documentation before running
> my tests (I'm not expert enough to not read the docs ;)) but it doesn't
> mention some specific requirement regarding the execution retries, I'll
> check it out, thank!
>
> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels <mxm@apache.org> wrote:
>>
>> Hi Stefano,
>>
>> The Job should stop temporarily but then be resumed by the new
>> JobManager. Have you increased the number of execution retries? AFAIK,
>> it is set to 0 by default. This will not re-run the job, even in HA
>> mode. You can enable it on the StreamExecutionEnvironment.
>>
>> Otherwise, you have probably already found the documentation:
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration
>>
>> Cheers,
>> Max
>>
>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
>> <stefano.baghino@radicalbit.io> wrote:
>> > Hello everyone,
>> >
>> > last week I've ran some tests with Apache ZooKeeper to get a grip on
>> > Flink
>> > HA features. My tests went bad so far and I can't sort out the reason.
>> >
>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster with
>> > 3
>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
>> > ensemble.
>> > I've started ZooKeeper on each machine, tested it's availability and
>> > then
>> > started the Flink cluster. Since there's no reliable distributed
>> > filesystem
>> > on the cluster, I had to use the local file system as the state backend.
>> >
>> > I then submitted a very simple streaming job that writes the timestamp
>> > on a
>> > text file on the local file system each second and then went on to kill
>> > the
>> > process running the job manager to verify that another job manager takes
>> > over. However, the job just stopped. I still have to perform some checks
>> > on
>> > the handover to the new job manager, but before digging deeper I wanted
>> > to
>> > ask if my expectation of having the job going despite the job manager
>> > failure is unreasonable.
>> >
>> > Thanks in advance.
>> >
>> > --
>> > BR,
>> > Stefano Baghino
>> >
>> > Software Engineer @ Radicalbit
>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit

Mime
View raw message