flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Deurwaarder <rich...@xeli.eu>
Subject Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up
Date Wed, 17 Jul 2019 17:49:43 GMT
Hello,

I've got a problem with our flink cluster where the jobmanager is not
starting up anymore, because it tries to download non existant (blob) file
from the zookeeper storage dir.

We're running flink 1.8.0 on a kubernetes cluster and use the google
storage connector [1] to store checkpoints, savepoints and zookeeper data.

When I noticed the jobmanager was having problems, it was in a crashloop
throwing file not found exceptions [2]
Caused by: java.io.FileNotFoundException: Item not found:
some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
If you enabled STRICT generation consistency, it is possible that the live
version is still available but the intended generation is deleted.

I looked in the blob directory and I can only find:
/recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
fiddle around in zookeeper to see if I could find anything [3], but I do
not really know what to look for.

How could this have happened and how should I recover the job from this
situation?

Thanks,

Richard

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
[2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
[3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a

Mime
View raw message