flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Robutti <simone.robu...@radicalbit.io>
Subject Re: Flink Checkpoint on yarn
Date Wed, 16 Mar 2016 15:12:47 GMT
This is the log filtered to check messages from
ZooKeeperCompletedCheckpointStore.

https://gist.github.com/chobeat/0222b31b87df3fa46a23

It looks like it finds only a checkpoint but I'm not sure if the different
hashes and IDs of the checkpoints are meaningful or not.



2016-03-16 15:33 GMT+01:00 Ufuk Celebi <uce@apache.org>:

> Can you please have a look into the JobManager log file and report
> which checkpoints are restored? You should see messages from
> ZooKeeperCompletedCheckpointStore like:
> - Found X checkpoints in ZooKeeper
> - Initialized with X. Removing all older checkpoints
>
> You can share the complete job manager log file as well if you like.
>
> – Ufuk
>
> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
> <simone.robutti@radicalbit.io> wrote:
> > Hello,
> >
> > I'm testing the checkpointing functionality with hdfs as a backend.
> >
> > For what I can see it uses different checkpointing files and resume the
> > computation from different points and not from the latest available.
> This is
> > to me an unexpected behaviour.
> >
> > I log every second, for every worker, a counter that is increased by 1 at
> > each step.
> >
> > So for example on node-1 the count goes up to 5, then I kill a job
> manager
> > or task manager and it resumes from 5 or 4 and it's ok. The next time I
> kill
> > a job manager the count is at 15 and it resumes at 14 or 15. Sometimes it
> > may happen that at a third kill the work resumes at 4 or 5 as if the
> > checkpoint resumed the second time wasn't there.
> >
> > Once I even saw it jump forward: the first kill is at 10 and it resumes
> at
> > 9, the second kill is at 70 and it resumes at 9, the third kill is at 15
> but
> > it resumes at 69 as if it resumed from the second kill checkpoint.
> >
> > This is clearly inconsistent.
> >
> > Also, in the logs I can find that sometimes it uses a checkpoint file
> > different from the previous, consistent resume.
> >
> > What am I doing wrong? Is it a known bug?
>

Mime
View raw message