spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Staffan Arvidsson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5350) There are issues when combining Spark and CDK (https://github.com/egonw/cdk).
Date Thu, 22 Jan 2015 14:21:46 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14287494#comment-14287494
] 

Staffan Arvidsson commented on SPARK-5350:
------------------------------------------

Well I've tried to fix this for some time now. I'm still trying to run it on my local computer,
so I can't use "Spark + Hadoop (2.x) from your cluster at runtime". I tried to add "<scope>provided</scope>"
in the maven pom, but that did not change anything. 

I have gone through some of the basic spark-programs, wordcount, calculating pi etc., which
has all worked out fine, until I added the dependency for the CDK library. I created a separate
maven project in my working directory, copied the pom from the erroneous project and only
removed the CDK-dependency. Works like a charm! There could be that the issue is located somewhere
else, but I have a hard time finding where that should be.. 

> There are issues when combining Spark and CDK (https://github.com/egonw/cdk). 
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-5350
>                 URL: https://issues.apache.org/jira/browse/SPARK-5350
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.1, 1.2.0
>         Environment: Running Spark using a local computer, using both Mac OS X and a
VM with Linux Ubuntu.
>            Reporter: Staffan Arvidsson
>
> I'm using Maven and Eclipse to build my project. When I import the CDK (https://github.com/egonw/cdk)
jar-files that I need, and setup the SparkContext and try for instance reading a file (by
simply "val lines = sc.textFile(filePath)") I get the following errors in the log:
> {quote}
> [main] DEBUG org.apache.spark.rdd.HadoopRDD  - SplitLocationInfo and other new Hadoop
classes are unavailable. Using the older Hadoop location info code.
> java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo
> 	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> 	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> 	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> 	at java.lang.Class.forName0(Native Method)
> 	at java.lang.Class.forName(Class.java:191)
> 	at org.apache.spark.rdd.HadoopRDD$SplitInfoReflections.<init>(HadoopRDD.scala:381)
> 	at org.apache.spark.rdd.HadoopRDD$.liftedTree1$1(HadoopRDD.scala:391)
> 	at org.apache.spark.rdd.HadoopRDD$.<init>(HadoopRDD.scala:390)
> 	at org.apache.spark.rdd.HadoopRDD$.<clinit>(HadoopRDD.scala)
> 	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:159)
> 	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 	at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> 	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
> 	at org.apache.spark.rdd.RDD.foreach(RDD.scala:765)
> {quote}
> later in the log: 
> {quote}
> [Executor task launch worker-0] DEBUG org.apache.spark.deploy.SparkHadoopUtil  - Couldn't
find method for retrieving thread-level FileSystem input data
> java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()
> 	at java.lang.Class.getDeclaredMethod(Class.java:2009)
> 	at org.apache.spark.util.Utils$.invoke(Utils.scala:1733)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> 	at org.apache.spark.deploy.SparkHadoopUtil$$anonfun$getFileSystemThreadStatistics$1.apply(SparkHadoopUtil.scala:178)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatistics(SparkHadoopUtil.scala:178)
> 	at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:138)
> 	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
> 	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
> 	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> 	at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:56)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> There has also been issues related to "HADOOP_HOME" not being set etc., but which seems
to be intermittent and only occur sometimes. 
> After testing different versions of both CDK and Spark, I've found out that the Spark
version 0.9.1 seems to get things to work. This will not solve my problem though, as I will
later need to use functionality from the MLlib that are only in the newer versions of Spark.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message