mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: spark-itemsimilarity can't launch on a Spark cluster?
Date Sat, 11 Oct 2014 01:22:16 GMT
Did you stop the 1.6g job or did it fail?

I see task failures but no stage failures.


On Oct 10, 2014, at 8:49 AM, pol <swallow_pulm@163.com> wrote:

Hi Pat,
	Yes, spark-itemsimilarity can be work ok, it had been finished calculation on 150m dataset.

	The problem above, 1.6g dataset can’t be finishing calculation, I have three machines(16
cores and 16g memory per) for this test, the environment can't finish the calculation?
	The dataset had archived one file by hadoop archive tool, such as only a machine at processing
state. To do so because don’t archive will be coming some error, about information can refer
to the attachment.
	<spark1.png>

<spark2.png>

<spark3.png>


	If you can, I will provide the test dataset to you. 

	Thank you again.


On Oct 10, 2014, at 22:07, Pat Ferrel <pat@occamsmachete.com> wrote:

> So it is completing some of the spar-itemsimilarity jobs now? That is better at least.
> 
> Yes. More data means you may need more memory or more nodes in your cluster. This is
how to scale Spark and Hadoop. Spark in particular needs core memory since it tries to avoid
disk read/write.
> 
> Try increasing -sem as fas as you can first then you may need to add machines to your
cluster tp speed it up. Do you need results faster than 15 hours.
> 
> Remember the way the Solr recommender works allows you to make recommendations to new
users and train less often. The new user data does no have to be in the training/indicator
data. You train partly based on how many new user but partly based on how many new items are
added to the catalog.
> 
> A\On Oct 10, 2014, at 1:47 AM, pol <swallow_pulm@163.com> wrote:
> 
> Hi Pat,
> 	Because of a holiday, now just reply.
> 
> 	I changed 1.0.2 to 1.0.1 for mahout-1.0-SNAPSHOT, and use Spark 1.0.1 , Hadoop 2.4.0,
spark-itemsimilarity can be work ok. But have a new question:
> 	mahout spark-itemsimilarity -i /view_input,/purchase_input -o /output -os -ma spark://recommend1:7077
-sem 15g -f1 purchase -f2 view -ic 2 -fc 1 -m 36
> 
> 	When "view" data:1.6g and "purchase" data:60m, this shell 15 hours are not performed("indicator-matrix"
had computed, and "cross-indicator-matrix" computing), but "view" data:100m finished 2 minutes
to perform, this is the reason of data?
> 
> 
> On Oct 1, 2014, at 01:10, Pat Ferrel <pat@occamsmachete.com> wrote:
> 
>> This will not be fixed in Mahout 1.0 unless we can find a problem in Mahout now.
I am the one who would fix it. At present it looks to me like a Spark version or setup problem.
>> 
>> These errors seem to indicate that the build or setup have a problems. It seems that
you cannot use Spark 1.10. Set up your cluster to use mahout-1.0-SNAPSHOT with pom set to
back to spark-1.0.1, Spark 1.0.1 build for Hadoop 2.4, and Hadoop 2.4. This is the only combination
that is supposed to work together.
>> 
>> If this still fails it may be a setup problems since I can run on a cluster just
fine with my setup. When you get an error from this config send it to me and the Spark user
list to see if they can give us a clue.
>> 
>> Question: Do you have mahout-1.0-SNAPSHOT and spark installed on all your cluster
machines, with the correct environment variables and path?
>> 
>> 
>> On Sep 30, 2014, at 12:47 AM, pol <swallow_pulm@163.com> wrote:
>> 
>> Hi Pat, 
>> 	It’s problem for Spark version, but spark-itemsimilarity is still can't the completion
of normal.
>> 
>> 1. Change 1.0.1 to 1.1.0 at mahout-1.0-SNAPSHOT/pom.xml, Spark version compatibility
is no problem, but the program has a problem:
>> --------------------------------------------------------------
>> 14/09/30 11:26:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 10.1 (TID
31, Hadoop.Slave1): java.lang.NoClassDefFoundError:  
>>         org/apache/commons/math3/random/RandomGenerator
>>         org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
>>         org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:228)
>>         org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:223)
>>         org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
>>         org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
>>         scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>         scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>         org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>>         org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>>         org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>>         org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>>         org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>         java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>         java.lang.Thread.run(Thread.java:662)
>> --------------------------------------------------------------
>> I tried to add commons-math3-3.2.jar to mahout-1.0-SNAPSHOT/lib, but still the same.
(It not directly use the RandomGenerator at RandomUtils.java:65)
>> 
>> 
>> 2. Change 1.0.1 to 1.0.2 at mahout-1.0-SNAPSHOT/pom.xml, there are still other errors:
>> --------------------------------------------------------------
>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Lost TID 427 (task 7.0:51)
>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Loss was due to java.lang.ClassCastException
>> java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Tuple2
>>         at org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$4.apply(TextDelimitedReaderWriter.scala:75)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>         at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>         at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>         at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>         at java.lang.Thread.run(Thread.java:662)
>> --------------------------------------------------------------
>> Please refer to the attachment for full log.
>> <screenlog_bash.log>
>> 
>> 
>> 
>> In addition, I used 66 files on HDFS than each file in 20 to 30 M,  if it is necessary
I will provide the data.
>> Shell is : mahout spark-itemsimilarity -i /rec/input/ss/others,/rec/input/ss/weblog
-o /rec/output/ss -os -ma spark://recommend1:7077 -sem 4g -f1 purchase -f2 view -ic 2 -fc
1
>> Spark cluster: 8 workers, 32 cores total, 32G memory total, at two machines.
>> 
>> Feeling a few days are not solved, not as good as waiting for Mahout 1.0 release
version or use mahout item similarity.
>> 
>> 
>> Thank you again, Pat.
>> 
>> 
>> On Sep 29, 2014, at 00:02, Pat Ferrel <pat@occamsmachete.com> wrote:
>> 
>>> It looks like the cluster version of spark-itemsimilarity is never accepted by
the Spark master. it fails in TextDelimitedReaderWriter.scala because all work is using “lazy”
evaluation and until the write no actual work is done on the Spark cluster.
>>> 
>>> However your cluster seems to be working with the Pi example. Therefore there
must be something wrong with the Mahout build or config. Some ideas:
>>> 
>>> 1) Mahout 1.0-SNAPSHOT is targeted for Spark 1.0.1.  However I use 1.0.2 and
it seems to work. You might try changing the version in the pom.xml and do a clean build of
Mahout. Change the version number in mahout/pom.xml
>>> 
>>> mahout/pom.xml
>>> -     <spark.version>1.0.1</spark.version>
>>> +    <spark.version>1.1.0</spark.version>
>>> 
>>> This may not be needed but it is easier than installing Spark 1.0.1.
>>> 
>>> 2) Try installing and building Mahout on all cluster machines. I do this so I
can run the Mahout spark-shell on any machine but it may be needed. The Mahout jars, path
setup, and directory structure should be the same on all cluster machines.
>>> 
>>> 3) Try making -sem larger. I usually make it as large a I can on the cluster
and try smaller until it affects performance. The epinions dataset that I use for testing
on my cluster requires -sem 6g.
>>> 
>>> My cluster has 3 machines with Hadoop 1.2.1 and Spark 1.0.2.  I can try running
your data through spark-itemsimilarity on my cluster if you can share it. I will sign an NDA
and destroy it after the test.
>>> 
>>> 
>>> 
>>> On Sep 27, 2014, at 5:28 AM, pol <swallow_pulm@163.com> wrote:
>>> 
>>> Hi Pat,
>>> 	Thank for your’s reply. It's still can't work normal, I tested it on a Spark
standalone cluster, don’t tested it on a YARN cluster.
>>> 
>>> First, test the cluster configuration is correct. http:///Hadoop.Master:8080
infos:
>>> -----------------------------------
>>> URL: spark://Hadoop.Master:7077
>>> Workers: 2
>>> Cores: 4 Total, 0 Used
>>> Memory: 2.0 GB Total, 0.0 B Used
>>> Applications: 0 Running, 1 Completed
>>> Drivers: 0 Running, 0 Completed
>>> Status: ALIVE
>>> ----------------------------------
>>> 
>>> Environment
>>> ----------------------------------
>>> OS: CentOS release 6.5 (Final)
>>> JDK: 1.6.0_45
>>> Mahout: mahout-1.0-SNAPSHOT(mvn -Dhadoop2.version=2.4.1 -DskipTests clean package)
>>> Hadoop: 2.4.1
>>> Spark: spark-1.1.0-bin-2.4.1(mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -Phive
-DskipTests clean package)
>>> ----------------------------------
>>> 
>>> Shell:
>>>      spark-submit --class org.apache.spark.examples.SparkPi --master spark://Hadoop.Master:7077
--executor-memory 1g --total-executor-cores 2 /root/spark-examples_2.10-1.1.0.jar 1000
>>> 
>>> It’s work ok, a part of the log for the shell:
>>> ----------------------------------
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 995.0 in stage
0.0 (TID 995) in 17 ms on Hadoop.Slave1 (996/1000)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 998.0 in stage
0.0 (TID 998, Hadoop.Slave2, PROCESS_LOCAL, 1225 bytes)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 996.0 in stage
0.0 (TID 996) in 20 ms on Hadoop.Slave2 (997/1000)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 999.0 in stage
0.0 (TID 999, Hadoop.Slave1, PROCESS_LOCAL, 1225 bytes)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 997.0 in stage
0.0 (TID 997) in 27 ms on Hadoop.Slave1 (998/1000)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 998.0 in stage
0.0 (TID 998) in 31 ms on Hadoop.Slave2 (999/1000)
>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 999.0 in stage
0.0 (TID 999) in 20 ms on Hadoop.Slave1 (1000/1000)
>>> 14/09/19 19:48:00 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35)
finished in 25.109 s
>>> 14/09/19 19:48:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose
tasks have all completed, from pool
>>> 14/09/19 19:48:00 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:35,
took 26.156022565 s
>>> Pi is roughly 3.14156112
>>> ----------------------------------
>>> 
>>> Second, test spark-itemsimilarity on "local", it's work ok, shell:
>>>      mahout spark-itemsimilarity -i /test/ss/input/data.txt -o /test/ss/output
-os -ma local[2] -sem 512m -f1 purchase -f2 view -ic 2 -fc 1
>>> 
>>> Third, test spark-itemsimilarity on "cluster", shell:
>>>      mahout spark-itemsimilarity -i /test/ss/input/data.txt -o /test/ss/output
-os -ma spark://Hadoop.Master:7077 -sem 512m -f1 purchase -f2 view -ic 2 -fc 1
>>> 
>>> It’s can’t work, full logs:
>>> ----------------------------------
>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>> SLF4J: Class path contains multiple SLF4J bindings.
>>> SLF4J: Found binding in [jar:file:/usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in [jar:file:/usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in [jar:file:/usr/spark-1.1.0-bin-2.4.1/lib/spark-assembly-1.1.0-hadoop2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>> 14/09/19 20:31:07 INFO spark.SecurityManager: Changing view acls to: root
>>> 14/09/19 20:31:07 INFO spark.SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root)
>>> 14/09/19 20:31:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>> 14/09/19 20:31:08 INFO Remoting: Starting remoting
>>> 14/09/19 20:31:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@Hadoop.Master:47597]
>>> 14/09/19 20:31:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@Hadoop.Master:47597]
>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering MapOutputTracker
>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering BlockManagerMaster
>>> 14/09/19 20:31:08 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140919203108-e4e3
>>> 14/09/19 20:31:08 INFO storage.MemoryStore: MemoryStore started with capacity
2.3 GB.
>>> 14/09/19 20:31:08 INFO network.ConnectionManager: Bound socket to port 47186
with id = ConnectionManagerId(Hadoop.Master,47186)
>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Trying to register BlockManager
>>> 14/09/19 20:31:08 INFO storage.BlockManagerInfo: Registering block manager Hadoop.Master:47186
with 2.3 GB RAM
>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Registered BlockManager
>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:41116
>>> 14/09/19 20:31:08 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.204.128:41116
>>> 14/09/19 20:31:08 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-10744709-bbeb-4d79-8bfe-d64d77799fb3
>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:59137
>>> 14/09/19 20:31:09 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>> 14/09/19 20:31:09 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
>>> 14/09/19 20:31:09 INFO ui.SparkUI: Started SparkUI at http://Hadoop.Master:4040
>>> 14/09/19 20:31:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar with timestamp
1411129870562
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with timestamp 1411129870588
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with timestamp 1411129870612
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar with timestamp 1411129870618
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar with timestamp
1411129870620
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with timestamp 1411129870631
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with timestamp 1411129870644
>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar
at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar with timestamp 1411129870647
>>> 14/09/19 20:31:10 INFO client.AppClient$ClientActor: Connecting to master spark://Hadoop.Master:7077...
>>> 14/09/19 20:31:13 INFO storage.MemoryStore: ensureFreeSpace(86126) called with
curMem=0, maxMem=2491102003
>>> 14/09/19 20:31:13 INFO storage.MemoryStore: Block broadcast_0 stored as values
to memory (estimated size 84.1 KB, free 2.3 GB)
>>> 14/09/19 20:31:13 INFO mapred.FileInputFormat: Total input paths to process :
1
>>> 14/09/19 20:31:13 INFO spark.SparkContext: Starting job: collect at TextDelimitedReaderWriter.scala:74
>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Registering RDD 7 (distinct at
TextDelimitedReaderWriter.scala:74)
>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Got job 0 (collect at TextDelimitedReaderWriter.scala:74)
with 2 output partitions (allowLocal=false)
>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at
TextDelimitedReaderWriter.scala:74)
>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage
1)
>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1)
>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[7]
at distinct at TextDelimitedReaderWriter.scala:74), which has no missing parents
>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from
Stage 1 (MapPartitionsRDD[7] at distinct at TextDelimitedReaderWriter.scala:74)
>>> 14/09/19 20:31:14 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with
2 tasks
>>> 14/09/19 20:31:29 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and have sufficient
memory
>>> 14/09/19 20:31:30 INFO client.AppClient$ClientActor: Connecting to master spark://Hadoop.Master:7077...
>>> 14/09/19 20:31:44 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and have sufficient
memory
>>> 14/09/19 20:31:50 INFO client.AppClient$ClientActor: Connecting to master spark://Hadoop.Master:7077...
>>> 14/09/19 20:31:59 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and have sufficient
memory
>>> 14/09/19 20:32:10 ERROR cluster.SparkDeploySchedulerBackend: Application has
been killed. Reason: All masters are unresponsive! Giving up.
>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose
tasks have all completed, from pool
>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1
>>> 14/09/19 20:32:10 INFO scheduler.DAGScheduler: Failed to run collect at TextDelimitedReaderWriter.scala:74
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted due to
stage failure: All masters are unresponsive! Giving up.
>>> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>> at scala.Option.foreach(Option.scala:236)
>>> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
>>> ----------------------------------
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>> On Sep 27, 2014, at 01:05, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> 
>>>> Any luck with this?
>>>> 
>>>> If not could you send a full stack trace and check on the cluster machines
for other logs that might help.
>>>> 
>>>> 
>>>> On Sep 25, 2014, at 6:34 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>>>> 
>>>> Looks like a Spark error as far as I can tell. This error is very generic
and indicates that the job was not accepted for execution so Spark may be configured wrong.
This looks like a question for the Spark people
>>>> 
>>>> My Spark sanity check:
>>>> 
>>>> 1)  In the Spark UI at  http:///Hadoop.Master:8080 does everything look correct?
>>>> 2) Have you tested your spark *cluster* with one of their examples? Have
you run *any non-Mahout* code on the cluster to check that it is configured properly? 
>>>> 3) Are you using exactly the same Spark and Hadoop locally as on the cluster?

>>>> 4) Did you launch both local and cluster jobs from the same cluster machine?
The only difference being the master URL (local[2] vs. spark://Hadoop.Master:7077)?
>>>> 
>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and have sufficient
memory
>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>> 
>>>> 
>>>> On Sep 24, 2014, at 8:18 PM, pol <swallow_pulm@163.com> wrote:
>>>> 
>>>> Hi, Pat
>>>> 	Dataset is the same, and the data is very few for test. This is a bug?
>>>> 
>>>> 
>>>> On Sep 25, 2014, at 02:57, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>>> 
>>>>> Are you using different data sets on the local and cluster?
>>>>> 
>>>>> Try increasing spark memory with -sem, I use -sem 6g for the epinions
data set.
>>>>> 
>>>>> The ID dictionaries are kept in-memory on each cluster machine so a large
number of user or item IDs will need more memory.
>>>>> 
>>>>> 
>>>>> On Sep 24, 2014, at 9:31 AM, pol <swallow_pulm@163.com> wrote:
>>>>> 
>>>>> Hi, All
>>>>> 	
>>>>> 	I’m sure it’s ok that launching Spark standalone to a cluster, but
it can’t work used for spark-itemsimilarity.
>>>>> 
>>>>> 	Launching on 'local' it’s ok:
>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt -o /user/root/test/output
-os -ma local[2] -f1 purchase -f2 view -ic 2 -fc 1 -sem 1g
>>>>> 
>>>>> 	but launching on a standalone cluster will be an error:
>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt -o /user/root/test/output
-os -ma spark://Hadoop.Master:7077 -f1 purchase -f2 view -ic 2 -fc 1 -sem 1g
>>>>> ------------
>>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>>> 14/09/22 04:13:02 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/22 04:13:09 INFO client.AppClient$ClientActor: Connecting to master
spark://Hadoop.Master:7077...
>>>>> 14/09/22 04:13:17 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are registered and have
sufficient memory
>>>>> 14/09/22 04:13:29 ERROR cluster.SparkDeploySchedulerBackend: Application
has been killed. Reason: All masters are unresponsive! Giving up.
>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0,
whose tasks have all completed, from pool 
>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Cancelling stage
1
>>>>> 14/09/22 04:13:29 INFO scheduler.DAGScheduler: Failed to run collect
at TextDelimitedReaderWriter.scala:74
>>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
due to stage failure: All masters are unresponsive! Giving up.
>>>>> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>> 	at scala.Option.foreach(Option.scala:236)
>>>>> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>> 	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>> ------------
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message