Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98DD710B71 for ; Mon, 14 Apr 2014 17:20:23 +0000 (UTC) Received: (qmail 47154 invoked by uid 500); 14 Apr 2014 17:20:19 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 47090 invoked by uid 500); 14 Apr 2014 17:20:18 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 47076 invoked by uid 99); 14 Apr 2014 17:20:18 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Apr 2014 17:20:18 +0000 Date: Mon, 14 Apr 2014 17:20:18 +0000 (UTC) From: "Pat Ferrel (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968537#comment-13968537 ] Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 5:18 PM: ------------------------------------------------------------- OK, I have a cluster set up but first tried locally on my laptop. I installed the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions has an incorrect comment println about usage--wrong object name. I never get the printlns, I assume because I'm not launching from the Spark shell??? println("Usage: RunCooccurrenceAnalysisOnMovielens1M ") This leads me to believe that you launch from the Spark Scala shell?? Anyway I tried the method called out in the Spark docs for CLI execution shown below and execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure where to look for output. The code says: RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0), "/tmp/co-occurrence-on-epinions/indicators-item-item/") RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1), "/tmp/co-occurrence-on-epinions/indicators-trust-item/") Assume this in localfs since the data came from there? I see the Spark pids there but no temp data. Here's how I ran it. Put data in localfs: Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/ total 29320 drwxr-xr-x 4 pat staff 136 Apr 14 09:01 . drwxr-xr-x 10 pat staff 340 Apr 14 09:00 .. -rw-r--r-- 1 pat staff 8650128 Apr 14 09:01 ratings_data.txt -rw-r--r-- 1 pat staff 6357397 Apr 14 09:01 trust_data.txt Start up Spark on localhost, webUI says all is well. Run the xrsj on local data via shell script attached. The driver runs and creates a worker, which runs for quite awhile but the log says there was an ERROR. Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1- spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2 spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1 spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out Spark Command: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp :/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077 ======================================== log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM 14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1 14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081 14/04/14 09:26:00 INFO Worker: Connecting to master spark://Maclaurin.local:7077... 14/04/14 09:26:00 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077 14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-0000 2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from SCDynamicStore 14/04/14 09:26:20 INFO DriverRunner: Copying user jar file:/Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar to /Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar 14/04/14 09:26:20 INFO DriverRunner: Launch Command: "/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java" "-cp" ":/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar:/usr/local/hadoop/conf" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@192.168.0.2:52068/user/Worker" "RunCrossCooccurrenceAnalysisOnEpinions" "file://Users/pat/hdfs-mirror/xrsj" 14/04/14 09:26:21 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val) scala.MatchError: FAILED (of class scala.Enumeration$Val) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/04/14 09:26:21 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM 14/04/14 09:26:21 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1 14/04/14 09:26:21 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081 14/04/14 09:26:21 INFO Worker: Connecting to master spark://Maclaurin.local:7077... 14/04/14 09:26:21 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077 was (Author: pferrel): OK, I have a cluster set up but first tried locally on my laptop. I installed the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions has an incorrect comment println about usage--wrong object name. I never get the printlns, I assume because I'm not launching from the Spark shell??? println("Usage: RunCooccurrenceAnalysisOnMovielens1M ") This leads me to believe that you launch from the Spark Scala shell?? Anyway I tried the method called out in the Spark docs for CLI execution shown below and execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure where to look for output. The code says: RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0), "/tmp/co-occurrence-on-epinions/indicators-item-item/") RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1), "/tmp/co-occurrence-on-epinions/indicators-trust-item/") Assume this in localfs since the data came from there? I see the Spark pids there but no temp data. Here's how I ran it. Put data in localfs: Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/ total 29320 drwxr-xr-x 4 pat staff 136 Apr 14 09:01 . drwxr-xr-x 10 pat staff 340 Apr 14 09:00 .. -rw-r--r-- 1 pat staff 8650128 Apr 14 09:01 ratings_data.txt -rw-r--r-- 1 pat staff 6357397 Apr 14 09:01 trust_data.txt Start up Spark on localhost, webUI says all is well. Run the xrsj on local data via shell script: #!/usr/bin/env bash #./bin/spark-class org.apache.spark.deploy.Client launch # [client-options] \ # \ # [application-options] # cluster-url: The URL of the master node. # application-jar-url: Path to a bundled jar including your application and all dependencies. Currently, the URL must be globally visible inside of # your cluster, for instance, an `hdfs://` path or$ # main-class: The entry point for your application. # Client Options: # --memory (amount of memory, in MB, allocated for your driver program) # --cores (number of cores allocated for your driver program) # --supervise (whether to automatically restart your driver on application or node failure) # --verbose (prints increased logging output) # RunCrossCooccurrenceAnalysisOnEpinions # Mahout Spark Jar from 'mvn package' /Users/pat/spark-0.9.1-bin-hadoop1/bin/spark-class org.apache.spark.deploy.Client launch \ spark://Maclaurin.local:7077 file:///Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar RunCrossCooccurrenceAnalysisOnEpinions \ file://Users/pat/hdfs-mirror/xrsj The driver runs and creates a worker, which runs for quite awhile but the log says there was an ERROR. Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1- spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2 spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1 spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out Spark Command: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp :/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077 ======================================== log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM 14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1 14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081 14/04/14 09:26:00 INFO Worker: Connecting to master spark://Maclaurin.local:7077... 14/04/14 09:26:00 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077 14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-0000 2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from SCDynamicStore 14/04/14 09:26:20 INFO DriverRunner: Copying user jar file:/Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar to /Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar 14/04/14 09:26:20 INFO DriverRunner: Launch Command: "/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java" "-cp" ":/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar:/usr/local/hadoop/conf" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@192.168.0.2:52068/user/Worker" "RunCrossCooccurrenceAnalysisOnEpinions" "file://Users/pat/hdfs-mirror/xrsj" 14/04/14 09:26:21 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val) scala.MatchError: FAILED (of class scala.Enumeration$Val) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/04/14 09:26:21 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM 14/04/14 09:26:21 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1 14/04/14 09:26:21 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081 14/04/14 09:26:21 INFO Worker: Connecting to master spark://Maclaurin.local:7077... 14/04/14 09:26:21 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077 > Cooccurrence Analysis on Spark > ------------------------------ > > Key: MAHOUT-1464 > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Environment: hadoop, spark > Reporter: Pat Ferrel > Assignee: Sebastian Schelter > Fix For: 1.0 > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)