flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Baghino <stefano.bagh...@radicalbit.io>
Subject Re: Issues testing Flink HA w/ ZooKeeper
Date Mon, 15 Feb 2016 12:40:07 GMT
Hi Ufuk, thanks for replying.

Regarding the masters file: yes, I've specified all the masters and checked
out that they were actually running after the start-cluster.sh. I'll gladly
share the logs as soon as I get to see them.

Regarding the state backend: how does having a non-distributed storage as
the state backend influence the HA features? I thought it would have meant
that the job state couldn't be restored but the job itself could've been
started after the backup job manager started. Does not having a reliable
distributed storage service as the state backend mean that the HA features
don't work?

Again, thank you very much.

On Mon, Feb 15, 2016 at 12:48 PM, Ufuk Celebi <uce@apache.org> wrote:

> Using the local file system as state backend only works if all job
> managers run on the same machine. Is that the case?
>
> Have you specified all job managers in the masters file? With the
> local file system state backend only something like
>
> host-X
> host-X
> host-X
>
> will be a valid masters configuration.
>
> Can you please share the job manager logs of all started job managers?
>
> – Ufuk
>
>
> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
> <stefano.baghino@radicalbit.io> wrote:
> > Hello everyone,
> >
> > last week I've ran some tests with Apache ZooKeeper to get a grip on
> Flink
> > HA features. My tests went bad so far and I can't sort out the reason.
> >
> > My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3
> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
> ensemble.
> > I've started ZooKeeper on each machine, tested it's availability and then
> > started the Flink cluster. Since there's no reliable distributed
> filesystem
> > on the cluster, I had to use the local file system as the state backend.
> >
> > I then submitted a very simple streaming job that writes the timestamp
> on a
> > text file on the local file system each second and then went on to kill
> the
> > process running the job manager to verify that another job manager takes
> > over. However, the job just stopped. I still have to perform some checks
> on
> > the handover to the new job manager, but before digging deeper I wanted
> to
> > ask if my expectation of having the job going despite the job manager
> > failure is unreasonable.
> >
> > Thanks in advance.
> >
> > --
> > BR,
> > Stefano Baghino
> >
> > Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Mime
View raw message