flink-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From se...@apache.org
Subject [1/2] flink git commit: [FLINK-5459] [docs] Add troubleshooting guide for classloading issues
Date Fri, 20 Jan 2017 18:07:15 GMT
Repository: flink
Updated Branches:
  refs/heads/master 6dffe282b -> 3b6267555

[FLINK-5459] [docs] Add troubleshooting guide for classloading issues

Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/3b626755
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/3b626755
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/3b626755

Branch: refs/heads/master
Commit: 3b626755588a461c8f195a48c4777fd533bf54ec
Parents: 24ff9eb
Author: Stephan Ewen <sewen@apache.org>
Authored: Fri Jan 20 18:50:21 2017 +0100
Committer: Stephan Ewen <sewen@apache.org>
Committed: Fri Jan 20 18:54:30 2017 +0100

 docs/monitoring/debugging_classloading.md | 101 ++++++++++++++++++++++---
 1 file changed, 92 insertions(+), 9 deletions(-)

diff --git a/docs/monitoring/debugging_classloading.md b/docs/monitoring/debugging_classloading.md
index e4e908e..40822c0 100644
--- a/docs/monitoring/debugging_classloading.md
+++ b/docs/monitoring/debugging_classloading.md
@@ -27,19 +27,102 @@ under the License.
 ## Overview of Classloading in Flink
-  - What is in the Application Classloader for different deployment techs
-  - What is in the user code classloader
+When running Flink applications, the JVM will load various classes over time.
+These classes can be devided into two domains:
-  - Access to the user code classloader for applications
+  - The **Flink Framework** domain: This includes all code in the `/lib` directory in the
Flink directory.
+    By default these are the classes of Apache Flink and its core dependencies.
-## Classpath Setups
+  - The **User Code** domain: These are all classes that are included in the JAR file submitted
via the CLI or web interface.
+    That includes the job's classes, and all libraries and connectors that are put into the
uber JAR.
+The class loading behaves slightly different for various Flink setups:
+When starting a the Flink cluster, the JobManagers and TaskManagers are started with the
Flink framework classes in the
+classpath. The classes from all jobs that are submitted against the cluster are loaded *dynamically*.
+YARN classloading differs between single job deploymens and sessions:
+  - When submitting a Flink job directly to YARN (via `bin/flink run -m yarn-cluster ...`),
dedicated TaskManagers and
+    JobManagers are started for that job. Those JVMs have both Flink framework classes and
user code classes in their classpath.
+    That means that there is *no dynamic classloading* involved in that case.
+  - When starting a YARN session, the JobManagers and TaskManagers are started with the Flink
framework classes in the
+    classpath. The classes from all jobs that are submitted against the session are loaded
+Mesos setups following [this documentation](../setup/mesos.html) currently behave very much
like the a 
+YARN session: The TaskManager and JobManager processes are started with the Flink framework
classes in classpath, job
+classes are loaded dynamically when the jobs are submitted.
+## Avoiding Dynamic Classloading
+All components (JobManger, TaskManager, Client, ApplicationMaster, ...) log their classpath
setting on startup.
+They can be found as part of the environment information at the beginnign of the log.
+When running a setup where the Flink JobManager and TaskManagers are exclusive to one particular
job, one can put JAR files
+directly into the `/lib` folder to make sure they are part of the classpath and not loaded
+It usually works to put the job's JAR file into the `/lib` directory. The JAR will be part
of both the classpath
+(the *AppClassLoader*) and the dynamic class loader (*FlinkUserCodeClassLoader*).
+Because the AppClassLoader is the parent of the FlinkUserCodeClassLoader (and Java loads
parent-first), this should
+result in classes being loaded only once.
+For setups where the job's JAR file cannot be put to the `/lib` folder (for example because
the setup is a session that is
+used by multiple jobs), it may still be posible to put common libraries to the `/lib` folder,
and avoid dynamic class loading
+for those.
+## Manual Classloading in the Job
+In some cases, a transformation function, source, or sink needs to manually load classes
(dynamically via reflection).
+To do that, it needs the classloader that has access to the job's classes.
+In that case, the functions (or sources or sinks) can be made a `RichFunction` (for example
`RichMapFunction` or `RichWindowFunction`)
+and access the user code class loader via `getRuntimeContext().getUserCodeClassLoader()`.
+## X cannot be cast to X exceptions
+When you see an exception in the style `com.foo.X cannot be cast to com.foo.X`, it means
that multiple versions of the class
+`com.foo.X` have been loaded by different class loaders, and types of that class are attempted
to be assigned to each other.
+The reason is in most cases that an object of the `com.foo.X` class loaded from a previous
execution attempt is still cached somewhere,
+and picked up by a restarted task/operator that reloaded the code. Note that this is again
only possible in deployments that use
+dynamic class loading.
+Common causes of cached object instances:
+  - When using *Apache Avro*: The *SpecificDatumReader* caches instances of records. Avoid
using `SpecificData.INSTANCE`. See also
+    [this discussion](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-get-help-on-ClassCastException-when-re-submitting-a-job-tp10972p11133.html)
+  - Using certain serialization frameworks for cloning objects (such as *Apache Avro*)
+  - Interning objects (for example via Guava's Interners)
-  - Finding classpaths in logs
-  - Moving libraries and/or user code to the Application Classpath 
 ## Unloading of Dynamically Loaded Classes
-  - Checkpoint statistics overview
-  - Interpret time until checkpoints
-  - Synchronous vs. asynchronous checkpoint time
+All scenarios that involve dynamic class loading (i.e., standalone, sessions, mesos, ...)
rely on classes being *unloaded* again.
+Class unloading means that the Garbage Collector finds that no objects from a class exist
and more, and thus removes the class
+(the code, static variable, metadata, etc).
+Whenever a TaskManager starts (or restarts) a task, it will load that specific task's code.
Unless classes can be unloaded, this will
+become a memory leak, as new versions of classes are loaded and the total number of loaded
classes accumulates over time. This
+typically manifests itself though a **OutOfMemoryError: PermGen**.
+Common causes for class leaks and suggested fixes:
+  - *Lingering Threads*: Make sure the application functions/sources/sinks shuts down all
threads. Lingering threads cost resources themselves and
+    additionally typically hold references to (user code) objects, preventing garbage collection
and unloading of the classes.
+  - *Interners*: Avoid caching objects in special structures that live beyond the lifetime
of the functions/sources/sinks. Examples are Guava's
+    interners, or Avro's class/object caches in the serializers.

View raw message