giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Garrett <...@actionx.com>
Subject Re: Superstep 0 finishes, but Superstep 1 doesn't start
Date Wed, 29 Jan 2014 17:39:42 GMT
>From the hadoop docs it looks like preemption is turned off by default:
https://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html

I have not changed that setting on the EMR clusters I've been testing on.

I was able to get a toy version of this program working on EMR. The version
I was testing with before was running against 720mb of gzipped data.

> This is kind of strange. It appears that you lost your workers after
superstep 0. Is preemption turned on for your cluster?


On Mon, Jan 27, 2014 at 5:18 PM, Sam Garrett <sam@actionx.com> wrote:

> Hello.
>
> I've been working on a modified version of the clustered components
> example code. I've been running Giraph in Amazon EMR successfully with
> other versions of this code, however with this iteration I keep getting
> this error on the master:
>
> 2014-01-27 21:58:45,756 INFO org.apache.giraph.master.MasterThread
> (org.apache.giraph.master.MasterThread): masterThread: Coordination of
> superstep 0 took 76.184 seconds ended with state THIS_SUPERSTEP_DONE and is
> now on superstep 1
> 2014-01-27 21:59:17,811 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 567946 more msecs left before giving up.
> 2014-01-27 21:59:47,817 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 537940 more msecs left before giving up.
> 2014-01-27 22:00:17,822 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 507935 more msecs left before giving up.
> 2014-01-27 22:00:47,828 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 477929 more msecs left before giving up.
> 2014-01-27 22:01:17,834 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 447923 more msecs left before giving up.
> 2014-01-27 22:01:47,840 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 417917 more msecs left before giving up.
> 2014-01-27 22:02:17,848 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 387909 more msecs left before giving up.
> 2014-01-27 22:02:47,853 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 357904 more msecs left before giving up.
> 2014-01-27 22:03:17,858 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 327899 more msecs left before giving up.
> 2014-01-27 22:03:47,863 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 297894 more msecs left before giving up.
> 2014-01-27 22:04:17,870 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 267887 more msecs left before giving up.
> 2014-01-27 22:04:47,876 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 237881 more msecs left before giving up.
> 2014-01-27 22:05:17,881 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 207876 more msecs left before giving up.
> 2014-01-27 22:05:47,887 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 177870 more msecs left before giving up.
> 2014-01-27 22:06:17,892 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 147865 more msecs left before giving up.
> 2014-01-27 22:06:47,897 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 117860 more msecs left before giving up.
> 2014-01-27 22:07:17,930 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 87827 more msecs left before giving up.
> 2014-01-27 22:07:47,936 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 57821 more msecs left before giving up.
> 2014-01-27 22:08:17,941 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> 27816 more msecs left before giving up.
> 2014-01-27 22:08:47,946 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
> -2189 more msecs left before giving up.
> 2014-01-27 22:08:47,946 ERROR org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): checkWorkers: Did not receive
> enough processes in time (only 0 of 15 required) after waiting
> 600000msecs).  This occurs if you do not have enough map tasks available
> simultaneously on your Hadoop instance to fulfill the number of requested
> workers.
> 2014-01-27 22:08:47,948 INFO org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): setJobState:
> {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on
> superstep 1
> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): failJob: Killing job
> job_201401272153_0001
> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
> (org.apache.giraph.master.MasterThread): failJob: exception
> java.lang.IllegalStateException: coordinateSuperstep: Not enough healthy
> workers for superstep 1
> 2014-01-27 22:08:48,022 ERROR org.apache.giraph.master.MasterThread
> (org.apache.giraph.master.MasterThread): masterThread: Master algorithm
> failed with NullPointerException
> java.lang.NullPointerException
>     at
> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
>     at
> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
>     at
> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
> 2014-01-27 22:08:48,023 FATAL org.apache.giraph.graph.GraphMapper
> (org.apache.giraph.master.MasterThread): uncaughtException:
> OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread,
> msg = java.lang.NullPointerException, exiting...
> java.lang.IllegalStateException: java.lang.NullPointerException
>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
> Caused by: java.lang.NullPointerException
>     at
> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
>     at
> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
>     at
> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
> 2014-01-27 22:08:48,025 INFO org.apache.giraph.zk.ZooKeeperManager
> (Thread-16): run: Shutdown hook started.
> 2014-01-27 22:08:48,025 WARN org.apache.giraph.zk.ZooKeeperManager
> (Thread-16): onlineZooKeeperServers: Forced a shutdown hook kill of the
> ZooKeeper process.
>
> All of the worker nodes succesfully complete step 0.
>
> I tried adding more mappers (machines) to the cluster and that doesn't
> seem to prevent the error. I've been using a forked version of the latest
> trunk code on the apache giraph github repo. I'm not sure why the workers
> would be able to communicate properly for step 0 and then fail to do so
> when starting step 1.
>
> My current cluster configuration is 1 x m2.4xlarge for the master node and
> then 15 x m1.large for the core nodes. I'm also using
> giraph.useOutOfCoreMessages=true.
>
> For reference here is a log from one of the workers:
>
> 2014-01-27 21:57:29,532 INFO
> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
> newStore: Created class
> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
> id class org.apache.hadoop.io.Text and message value class
> org.apache.hadoop.io.Text and no combiner
> 2014-01-27 21:57:29,532 INFO
> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
> newStore: Created class
> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
> id class org.apache.hadoop.io.Text and message value class
> org.apache.hadoop.io.Text and no combiner
> 2014-01-27 21:57:29,550 WARN org.apache.giraph.bsp.BspService
> (main-EventThread): process: Unknown and unprocessed event
> (path=/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir,
> type=NodeChildrenChanged, state=SyncConnected)
> 2014-01-27 21:57:29,654 INFO org.apache.giraph.worker.BspServiceWorker
> (main): registerHealth: Created my health node for attempt=0, superstep=0
> with
> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
> and workerInfo= Worker(hostname=ip-10-9-140-146.ec2.internal, MRtaskID=1,
> port=30001)
> 2014-01-27 21:57:29,696 INFO org.apache.giraph.bsp.BspService
> (main-EventThread): process: partitionAssignmentsReadyChanged (partitions
> are assigned)
> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
> (main): startSuperstep: Master(hostname=ip-10-238-165-32.ec2.internal,
> MRtaskID=0, port=30000)
> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
> (main): startSuperstep: Ready for computation on superstep 0 since worker
> selection and vertex range assignments are done in
> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_addressesAndPartitions
> 2014-01-27 21:57:29,715 INFO org.apache.giraph.comm.netty.NettyClient
> (main): connectAllAddresses: Successfully added 0 connections, (0 total
> connected) 0 failed, 0 failures total.
> 2014-01-27 21:57:29,719 INFO org.apache.giraph.comm.netty.NettyClient
> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0001,
> MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req
> MBytes = 0, ave received req MBytes = 0, secs waited = 0.31
> 2014-01-27 21:57:29,726 INFO org.apache.giraph.worker.BspServiceWorker
> (main): sendWorkerPartitions: Done sending all my partitions.
> 2014-01-27 21:57:29,738 INFO org.apache.giraph.worker.BspServiceWorker
> (main): exchangeVertexPartitions: Done with exchange.
> 2014-01-27 21:57:29,745 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier
> (main): waitForRequiredPermits: Waiting for 1 more tasks to send their
> aggregator data, task ids: [4]
> 2014-01-27 21:57:29,747 INFO org.apache.giraph.graph.GraphTaskManager
> (main): execute: 15 partitions to process with 1 compute thread(s),
> originally 1 thread(s) on superstep 0
> 2014-01-27 21:57:31,649 INFO
> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-3):
> decode: Server window metrics MBytes/sec sent = 0.0014, MBytes/sec received
> = 1.3229, MBytesSent = 0.042, MBytesReceived = 39.6883, ave sent req MBytes
> = 0, ave received req MBytes = 0.0139, secs waited = 30.0
> 2014-01-27 21:58:04,507 INFO
> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-7):
> decode: Server window metrics MBytes/sec sent = 0.0348, MBytes/sec received
> = 7.4575, MBytesSent = 1.1421, MBytesReceived = 245.0474, ave sent req
> MBytes = 0, ave received req MBytes = 0.0137, secs waited = 32.858
> 2014-01-27 21:58:05,725 INFO org.apache.giraph.graph.ComputeCallable
> (compute-0): call: Completed 6 partitions, 9 remaining Memory
> (free/total/max) = 766.17M / 1939.00M / 3641.00M
> 2014-01-27 21:58:32,284 INFO org.apache.giraph.utils.ProgressableUtils
> (main): waitFor: Future result not ready yet
> java.util.concurrent.FutureTask@7c2bc94a
> 2014-01-27 21:58:32,292 INFO org.apache.giraph.utils.ProgressableUtils
> (main): waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3ae3f711
> 2014-01-27 21:58:34,508 INFO
> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-1):
> decode: Server window metrics MBytes/sec sent = 0.0325, MBytes/sec received
> = 7.2723, MBytesSent = 0.9743, MBytesReceived = 218.1846, ave sent req
> MBytes = 0, ave received req MBytes = 0.0191, secs waited = 30.001
> 2014-01-27 21:58:36,622 INFO org.apache.giraph.graph.ComputeCallable
> (compute-0): call: Completed 14 partitions, 1 remaining Memory
> (free/total/max) = 872.90M / 2989.00M / 3641.00M
> 2014-01-27 21:58:38,073 INFO org.apache.giraph.graph.ComputeCallable
> (compute-0): call: Computation took 68.32348 secs for 15 partitions on
> superstep 0.  Flushing started
> 2014-01-27 21:58:38,095 INFO org.apache.giraph.worker.BspServiceWorker
> (main): finishSuperstep: Waiting on all requests, superstep 0 Memory
> (free/total/max) = 716.70M / 2989.00M / 3641.00M
> 2014-01-27 21:58:38,095 INFO org.apache.giraph.comm.netty.NettyClient
> (main): logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 5568
> open requests, waiting for it to be <= 0, MBytes/sec sent = 7.6591,
> MBytes/sec received = 0.0349, MBytesSent = 466.4169, MBytesReceived =
> 2.1229, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
> waited = 60.896
> 2014-01-27 21:58:38,101 INFO org.apache.giraph.comm.netty.NettyClient
> (main): logInfoAboutOpenRequests: 2808 requests for taskId=8, 2641 requests
> for taskId=11, 104 requests for taskId=3, 1 requests for taskId=5, 1
> requests for taskId=14,
> 2014-01-27 21:58:43,728 INFO org.apache.giraph.comm.netty.NettyClient
> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 7.0106,
> MBytes/sec received = 0.0329, MBytesSent = 466.4169, MBytesReceived =
> 2.1919, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
> waited = 66.529
> 2014-01-27 21:58:43,728 INFO
> org.apache.giraph.worker.WorkerAggregatorHandler (main): finishSuperstep:
> Start gathering aggregators, workers will send their aggregated values once
> they are done with superstep computation
> 2014-01-27 21:58:43,768 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier
> (main): waitForRequiredPermits: Waiting for 3 more tasks to send their
> aggregator data, task ids: [11, 13, 14]
> 2014-01-27 21:58:45,673 INFO org.apache.giraph.comm.netty.NettyClient
> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0048,
> MBytes/sec received = 0.0025, MBytesSent = 0, MBytesReceived = 0, ave sent
> req MBytes = 0, ave received req MBytes = 0, secs waited = 0.005
> 2014-01-27 21:58:45,673 INFO org.apache.giraph.worker.BspServiceWorker
> (main): finishSuperstep: Superstep 0, messages = 0 , message bytes = 0 ,
> Memory (free/total/max) = 1093.58M / 2989.00M / 3641.00M
> 2014-01-27 21:58:45,682 INFO org.apache.giraph.worker.BspServiceWorker
> (main): finishSuperstep: (waiting for rest of workers) WORKER_ONLY -
> Attempt=0, Superstep=0
> 2014-01-27 21:58:45,721 INFO org.apache.giraph.bsp.BspService
> (main-EventThread): process: superstepFinished signaled
> 2014-01-27 21:58:45,744 INFO org.apache.giraph.worker.BspServiceWorker
> (main): finishSuperstep: Completed superstep 0 with global stats
> (vtx=364766,finVtx=0,edges=3815107,msgCount=0,msgBytesCount=0,haltComputation=false)
> and classes
> (computation=org.apache.giraph.examples.ReflexiveConnectedStringComponentsComputation,combiner=null)
> 2014-01-27 21:58:45,744 INFO
> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
> newStore: Created class
> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
> id class org.apache.hadoop.io.Text and message value class
> org.apache.hadoop.io.Text and no combiner
> 2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker
> (main): unregisterHealth: Got failure, unregistering health on
> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
> on superstep 1
> 2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main):
> Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of
> commiting
> 2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task
> 'attempt_201401272153_0001_m_000001_0' done.
>
> Thank you for your time and help.
>
> --
> Sam
>



-- 
Sam Garrett
ActionX, NYC, 260 820 0394

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message