Regarding your last reply to http://apache-flink-user-
mailing-list-archive.2336050. n4.nabble.com/blob-store- defaults-to-tmp-and-files-get- deleted-td11720.html
You mention "Flink (via the user code class loader) actually holds a reference to the JAR files in "/tmp", so even if "/tmp" get wiped, the JAR file remains usable by the class loader". In my understanding, even if that's true, it doesn't work over a failure of the JobManager/TaskManager process, because the handle would be lost and the file would be gone.
We're still running Flink 1.2.1, so maybe we're missing out on some of the improvements that have been made. However, we recently had a problem with a batch (DataSet) job not restarting successfully, apparently after a JobManager failure. This particular job runs in AWS EMR (on Yarn) which means that only one JobManager is run at a time, and when it fails it gets restarted.
Here's what I can see from the logs. When the job restarts, it goes from CREATED -> RUNNING state, and then logs:
23:23:56,798 INFO org.apache.flink.runtime.
executiongraph.ExecutionGraph (flink-akka.actor.default- dispatcher-55): Job com.expedia…MyJob ( c58185a78dd64cfc9f12374bd1f9a6 79) switched from state RUNNING to SUSPENDED.java.lang.Exception: JobManager is no longer the leader.at org.apache.flink.runtime. jobmanager.JobManager$$ anonfun$handleMessage$1. applyOrElse(JobManager.scala: 319)
I assume that's normal/expected, because the JobManager was restarted but some portion of the job state is still referring to the old one. Next, YarnJobManager logs: "Attempting to recover job c58185a78dd64cfc9f12374bd1f9a6
79." However, it subsequently fails:
2017-08-03 00:09:18,991 WARN org.apache.flink.yarn.
YarnJobManager (flink-akka.actor.default- dispatcher-96): Failed to recover job c58185a78dd64cfc9f12374bd1f9a6 79.java.lang.Exception: Failed to retrieve the submitted job graph from state handle.at org.apache.flink.runtime. jobmanager. ZooKeeperSubmittedJobGraphStor e.recoverJobGraph( ZooKeeperSubmittedJobGraphStor e.java:180)…Caused by: java.lang.RuntimeException: Unable to instantiate the hadoop input formatat org.apache.flink.api.java. hadoop.mapreduce. HadoopInputFormatBase. readObject( HadoopInputFormatBase.java: 319)…at org.apache.flink.util. InstantiationUtil. deserializeObject( InstantiationUtil.java:305)at org.apache.flink.runtime. state. RetrievableStreamStateHandle. retrieveState( RetrievableStreamStateHandle. java:58)at org.apache.flink.runtime. jobmanager. ZooKeeperSubmittedJobGraphStor e.recoverJobGraph( ZooKeeperSubmittedJobGraphStor e.java:178)... 15 moreCaused by: java.lang. ClassNotFoundException: org.apache.parquet.avro. AvroParquetInputFormatat java.net.URLClassLoader. findClass(URLClassLoader.java: 381)at java.lang.ClassLoader. loadClass(ClassLoader.java: 424)at sun.misc.Launcher$ AppClassLoader.loadClass( Launcher.java:331)at java.lang.ClassLoader. loadClass(ClassLoader.java: 357)at java.lang.Class.forName0( Native Method)at java.lang.Class.forName(Class. java:348)at org.apache.flink.api.java. hadoop.mapreduce. HadoopInputFormatBase. readObject( HadoopInputFormatBase.java: 317)... 69 more
The missing class comes from our job, so it seems like the job jar isn't present on the classpath of the JobManager. When I look at the contents of our configured blob storage directory (we're not using /tmp), I see subfolders like:
Only one of the two has a JAR in it, so it looks like there's a new directory created for each new JobManager. When I look in Zookeeper at nodes such as "/flink/main/jobgraphs/
c58185a78dd64cfc9f12374bd1f9a6 79", I don't see those directories mentioned. I am wondering if someone can explain how Flink knows how to retrieve the job jar for job retry when the JobManager has failed? Are we running into a Flink bug here?
Thanks for the info,Shannon