spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE:
Date Thu, 24 Apr 2014 01:10:38 GMT
This sounds like a configuration issue.  Either you have not set the MASTER correctly, or possibly
another process is using up all of the cores
Dave

From: ge ko [mailto:koenig.ulm@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:


Hi,

I'm still going to start working with Spark and installed the parcels in our CDH5 GA cluster.



Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster

Like some advices told me to use FQDN, the settings above sound reasonable for me .



Both daemons are running, Master-Web-UI shows the connected worker, and the log entries show:

master:

2014-04-13 21:26:40,641 INFO Remoting: Starting remoting
2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master: Starting Spark master
at spark://hadoop-pg-5.cluster:7077
...

2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18080<http://SelectChannelConnector@0.0.0.0:18080>
2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI: Started Master
web UI at http://hadoop-pg-5.cluster:18080
2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have been elected leader!
New state: ALIVE

2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master: Registering worker hadoop-pg-5.cluster:7078
with 2 cores, 64.0 MB RAM



worker:

2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
2014-04-13 21:27:39,136 INFO Remoting: Starting remoting
2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078]
2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker: Starting Spark worker
hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM
2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
...

2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18081<http://SelectChannelConnector@0.0.0.0:18081>
2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI: Started Worker
web UI at http://hadoop-pg-7.cluster:18081
2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker: Connecting to master spark://hadoop-pg-5.cluster:7077...
2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker: Successfully registered
with master spark://hadoop-pg-5.cluster:7077



Looks good, so far.



Now I want to execute the python pi example by executing (on the worker):

cd /opt/cloudera/parcels/CDH/lib/spark && ./bin/pyspark ./python/examples/pi.py spark://hadoop-pg-5.cluster:7077



Here the strange thing happens, the script doesn't get executed, it hangs (repeating this
output forever) at :



14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient memory



The whole log is:





14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started
14/04/13 21:30:45 INFO Remoting: Starting remoting
14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster
14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140413213045-acec
14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9 MB.
14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with id = ConnectionManagerId(hadoop-pg-7.cluster,57506)
14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager
14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager
hadoop-pg-7.cluster:57506 with 294.9 MB RAM
14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at http://10.147.210.7:51224
14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker
14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is /tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at http://hadoop-pg-7.cluster:4040
14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master spark://hadoop-pg-5.cluster:7077...
14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID
app-20140413213046-0000
14/04/13 21:30:48 INFO SparkContext: Starting job: reduce at ./python/examples/pi.py:36
14/04/13 21:30:48 INFO DAGScheduler: Got job 0 (reduce at ./python/examples/pi.py:36) with
2 output partitions (allowLocal=false)
14/04/13 21:30:48 INFO DAGScheduler: Final stage: Stage 0 (reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO DAGScheduler: Parents of final stage: List()
14/04/13 21:30:48 INFO DAGScheduler: Missing parents: List()
14/04/13 21:30:48 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36),
which has no missing parents
14/04/13 21:30:48 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1]
at reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient memory





Thereby I have to cancel the execution of the script. If I am doing this, I receive the following
log entries on the master (! at cancellation of the python pi script !):



2014-04-13 21:30:46,965 INFO org.apache.spark.deploy.master.Master: Registering app PythonPi
2014-04-13 21:30:46,974 INFO org.apache.spark.deploy.master.Master: Registered app PythonPi
with ID app-20140413213046-0000
2014-04-13 21:31:27,123 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601
got disassociated, removing it.
2014-04-13 21:31:27,125 INFO org.apache.spark.deploy.master.Master: Removing app app-20140413213046-0000
2014-04-13 21:31:27,143 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601
got disassociated, removing it.
2014-04-13 21:31:27,144 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.147.210.7%3A44207-2#-389971336]
was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted
with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
2014-04-13 21:31:27,194 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
-> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]]
[
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,199 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601
got disassociated, removing it.
2014-04-13 21:31:27,215 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
-> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]]
[
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,222 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601
got disassociated, removing it.
2014-04-13 21:31:27,234 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
-> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]]
[
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: hadoop-pg-7.cluster/10.147.210.7:50601<http://10.147.210.7:50601>
]
2014-04-13 21:31:27,238 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601
got disassociated, removing it.





What is going wrong here ?!?!?!?



I get the same behaviour if I start the spark-shell on the worker and try to execute e.g.
sc.parallelize(1 to 100,10).count



any help highly appreciated, Gerd










Mime
View raw message