flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs
Date Mon, 07 Nov 2016 10:46:41 GMT



On 4 November 2016 at 17:09:25, Josh (jofo90@gmail.com) wrote:
> Thanks, I didn't know about the -z flag!
>  
> I haven't been able to get it to work though (using yarn-cluster, with a
> zookeeper root configured to /flink in my flink-conf.yaml)
>  
> I can see my job directory in ZK under
> /flink/application_1477475694024_0015 and I've tried a few ways to restore
> the job:
>  
> ./bin/flink run -m yarn-cluster -yz /application_1477475694024_0015 ....
> ./bin/flink run -m yarn-cluster -yz application_1477475694024_0015 ....
> ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015/
> ....
> ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015
> ....
>  
> The job starts from scratch each time, without restored state.
>  
> Am I doing something wrong? I've also tried with -z instead of -yz but I'm
> using yarn-cluster to run a single job, so I think it should be -yz.

Can you please check the JobManager logs of the initial job that you want to resume and look
for a line like this:


Using ‘.../flink/application_...' as Zookeeper namespace.

Now you need to set the part after 'flink/' as the namespace, probably "application_1477475694024_0015"
(from your last message).

The flag should be just -z. You can also set it in the Flink config file:

high-availability.cluster-id: application_1477475694024_0015

Does this help?

---

There is also a new feature in Flink 1.2 allowing you to persist every checkpoint externally.
The feature is already in, but the configuration will be adjusted (https://github.com/apache/flink/pull/2752).

Currently you can configure it by specifying a checkpoint directory manually via:

state.checkpoints.dir: hdfs:///flink/checkpoints

In the CheckpointConfig you enable it via

CheckpoingConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);


– Ufuk






Mime
View raw message