flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simone Robutti <simone.robu...@radicalbit.io>
Subject Re: Flink Checkpoint on yarn
Date Wed, 16 Mar 2016 17:55:23 GMT
Actually the test was intended for a single job. The fact that there are
more jobs is unexpected and it will be the first thing to verify.
Considering these problems we will go for deeper tests with multiple jobs.

The logs are collected with "yarn logs" but log aggregation is not properly
configured so I wouldn't rely too much on that. Before doing the tests
tomorrow I will clear all the existing logs just to be sure.

2016-03-16 18:19 GMT+01:00 Ufuk Celebi <uce@apache.org>:

> OK, so you are submitting multiple jobs, but you submit them with -m
> yarn-cluster and therefore expect them to start separate YARN
> clusters. Makes sense and I would expect the same.
>
> I think that you can check in the client logs printed to stdout to
> which cluster the job is submitted.
>
> PS: The logs you have shared are out-of-order, how did you gather
> them? Do you have an idea why they are out of order? Maybe something
> is mixed up in the way we gather the logs and we only think that
> something is wrong because of this.
>
>
> On Wed, Mar 16, 2016 at 6:11 PM, Simone Robutti
> <simone.robutti@radicalbit.io> wrote:
> > I didn't resubmitted the job. Also the jobs are submitted one by one
> with -m
> > yarn-master, not with a long running yarn session so I don't really know
> if
> > they could mix up.
> >
> > I will repeat the test with a cleaned state because we saw that killing
> the
> > job with yarn application -kill left the "flink run" process alive so
> that
> > may be the problem. We just noticed a few minutes ago.
> >
> > If the problem persists, I will eventually come back with a full log.
> >
> > Thanks for now,
> >
> > Simone
> >
> > 2016-03-16 18:04 GMT+01:00 Ufuk Celebi <uce@apache.org>:
> >>
> >> Hey Simone,
> >>
> >> from the logs it looks like multiple jobs have been submitted to the
> >> cluster, not just one. The different files correspond to different
> >> jobs recovering. The filtered logs show three jobs running/recovering
> >> (with IDs 10d8ccae6e87ac56bf763caf4bc4742f,
> >> 124f29322f9026ac1b35435d5de9f625, 7f280b38065eaa6335f5c3de4fc82547).
> >>
> >> Did you manually re-submit the job after killing a job manager?
> >>
> >> Regarding the counts, it can happen that they are rolled back to a
> >> previous consistent state if the checkpoint was not completed yet
> >> (including the write to ZooKeeper). In that case the job state will be
> >> rolled back to an earlier consistent state.
> >>
> >> Can you please share the complete job manager logs of your program?
> >> The most helpful thing will be to have a log for each started job
> >> manager container. I don't know if that is easily possible.
> >>
> >> – Ufuk
> >>
> >> On Wed, Mar 16, 2016 at 4:12 PM, Simone Robutti
> >> <simone.robutti@radicalbit.io> wrote:
> >> > This is the log filtered to check messages from
> >> > ZooKeeperCompletedCheckpointStore.
> >> >
> >> > https://gist.github.com/chobeat/0222b31b87df3fa46a23
> >> >
> >> > It looks like it finds only a checkpoint but I'm not sure if the
> >> > different
> >> > hashes and IDs of the checkpoints are meaningful or not.
> >> >
> >> >
> >> >
> >> > 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <uce@apache.org>:
> >> >>
> >> >> Can you please have a look into the JobManager log file and report
> >> >> which checkpoints are restored? You should see messages from
> >> >> ZooKeeperCompletedCheckpointStore like:
> >> >> - Found X checkpoints in ZooKeeper
> >> >> - Initialized with X. Removing all older checkpoints
> >> >>
> >> >> You can share the complete job manager log file as well if you like.
> >> >>
> >> >> – Ufuk
> >> >>
> >> >> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
> >> >> <simone.robutti@radicalbit.io> wrote:
> >> >> > Hello,
> >> >> >
> >> >> > I'm testing the checkpointing functionality with hdfs as a backend.
> >> >> >
> >> >> > For what I can see it uses different checkpointing files and resume
> >> >> > the
> >> >> > computation from different points and not from the latest
> available.
> >> >> > This is
> >> >> > to me an unexpected behaviour.
> >> >> >
> >> >> > I log every second, for every worker, a counter that is increased
> by
> >> >> > 1
> >> >> > at
> >> >> > each step.
> >> >> >
> >> >> > So for example on node-1 the count goes up to 5, then I kill a
job
> >> >> > manager
> >> >> > or task manager and it resumes from 5 or 4 and it's ok. The next
> time
> >> >> > I
> >> >> > kill
> >> >> > a job manager the count is at 15 and it resumes at 14 or 15.
> >> >> > Sometimes
> >> >> > it
> >> >> > may happen that at a third kill the work resumes at 4 or 5 as
if
> the
> >> >> > checkpoint resumed the second time wasn't there.
> >> >> >
> >> >> > Once I even saw it jump forward: the first kill is at 10 and it
> >> >> > resumes
> >> >> > at
> >> >> > 9, the second kill is at 70 and it resumes at 9, the third kill
is
> at
> >> >> > 15
> >> >> > but
> >> >> > it resumes at 69 as if it resumed from the second kill checkpoint.
> >> >> >
> >> >> > This is clearly inconsistent.
> >> >> >
> >> >> > Also, in the logs I can find that sometimes it uses a checkpoint
> file
> >> >> > different from the previous, consistent resume.
> >> >> >
> >> >> > What am I doing wrong? Is it a known bug?
> >> >
> >> >
> >
> >
>

Mime
View raw message