spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Younos Aboulnaga (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14638) Spark task does not have access to a dependency in the classloader of the executor thread
Date Wed, 11 May 2016 20:23:12 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Younos Aboulnaga updated SPARK-14638:
-------------------------------------
    Component/s: Spark Core

> Spark task does not have access to a dependency in the classloader of the executor thread
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-14638
>                 URL: https://issues.apache.org/jira/browse/SPARK-14638
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.1, 1.4.1, 1.6.0, 1.6.1
>         Environment: > uname -a
> Linux HOSTNAME 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64
x86_64 x86_64 GNU/Linux
> > java -version
> java version "1.8.0_77"
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>            Reporter: Younos Aboulnaga
>
> We have started to frequently see Spark apps failing because of a NoClassDefFoundError
thrown despite that the dependency had been added to the ClassLoader just before it was thrown.
The [Executor.run method adds the JAR|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193]
containing the class but then a NoClassDefFoundError is thrown subsequently. We see log messages
from [updateDependencies|https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/executor/Executor.scala#L386]
indicating that the JAR is fetched and added to the class loader. Upon inspection of the worker
dir, the JAR is there, it is not corrupted, and it contains the class that could not be found
in the class loader. 
> We first saw this when we started writing streaming apps, and we thought it is something
specific to streaming apps. However, this was wrong as the same problem happened with several
batch apps. 
> We first saw this on a Standalone cluster, and we though that it might be a problem caused
by the lack of resource manager. Now we installed Mesos and the problem still happens. 
> I tried to create a POC Spark App that demonstrates the problem, but I couldn't reliably
reproduce it. The problem would still happen in other apps, but it didn't happen in the POC
app even though I made it structurally the same as any other app we run. The problem seems
to be environmental, specially because we found a work around for it.
> The work around we found is setting SPARK_CLASSPATH *on the executor nodes* to a local
copy of the dependency. The problem still happens if we set the 'spark.executor.extraClassPath'
or 'spark.driver.extraClassPath' or set SPARK_CLASSPATH on the driver node. However, if the
SPARK_CLASSPATH is set on the executor node, then the problem doesn't happen because the JAR
doesn't need to be added to the class loader by Executor#updateDependencies.
> Other symptoms of the problem are the following:
> 1) Even though there is a 'log4j.properties' in the 'spark.executor.extraClassPath',
the first line of the stderr of the worker says "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties"
The log4j.properties file that is shipped with the job is totally neglected. 
> 2) Any configuration files on 'spark.executor.extraClassPath' are neglected. I am mentioning
this because log4j.properties is loaded very early on and in a static call, which might sway
the troubleshooting into wrong directions.
> Here is the specific example in our case:
> > grep NoClassDef workers/app-20160414111328-0043/0/stderr
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.protobuf.ProtobufUtil
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.protobuf.ProtobufUtil
> .. SEVERAL ATTEMPTS ...
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.protobuf.ProtobufUtil

> Even though, in the same application worker dir:
> > for j in workers/app-20160414111328-0043/0/*.jar ; do jar tf $j | grep ProtobufUtil
; done;
> org/apache/hadoop/hbase/protobuf/ProtobufUtil$1.class
> org/apache/hadoop/hbase/protobuf/ProtobufUtil.class
> There are other examples, specially for configurations not being found. I think the SPARK-12279
can also be caused by  the same root cause.
> We have been seeing this in several of our clusters and several engineers had spent days
looking into why their applications suffer from this. We rebuilt our infrastructure (always
on AWS EC2 nodes) and tested many hypotheses, including things that are non-sensical, and
we still can't find anything that reliably reproduces the problem. The only reliable piece
of information is that setting SPARK_CLASSPATH *on the executor nodes* prevents the problem
from happening, because then the dependencies are included in the -cp parameter of the java
command running the CoarseGrainedExecutorBackend .
> We would appreciate if someone more knowledgeable in Spark internals take a look, and
we can help by providing as much details as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message