Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 49493109B7 for ; Wed, 4 Dec 2013 09:35:57 +0000 (UTC) Received: (qmail 27544 invoked by uid 500); 4 Dec 2013 09:35:51 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 27126 invoked by uid 500); 4 Dec 2013 09:35:43 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 27118 invoked by uid 99); 4 Dec 2013 09:35:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Dec 2013 09:35:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [96.31.32.195] (HELO mc.internetmailserver.net) (96.31.32.195) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Dec 2013 09:35:31 +0000 Received: from smg03.internetmailserver.net (localhost [127.0.0.1]) by mc.internetmailserver.net (Postfix) with ESMTP id D9EE4621D3 for ; Wed, 4 Dec 2013 01:34:46 -0800 (PST) X-Sender-Id: rvesse@dotnetrdf.org Received: from smg03.internetmailserver.net (smg03.internetmailserver.net [216.32.60.7]) by 0.0.0.0:2501 (trex/4.8.90); Wed, 04 Dec 2013 09:34:52 GMT X-MC-Relay: Good Received: from sm06.internetmailserver.net (sm06.dotnetplayground.com [192.168.120.26]) by smg03.internetmailserver.net with SMTP; Wed, 4 Dec 2013 01:34:40 -0800 Received: from [192.168.1.65] (host31-50-86-94.range31-50.btcentralplus.com [31.50.86.94]) by sm06.internetmailserver.net with SMTP; Wed, 4 Dec 2013 01:34:51 -0800 User-Agent: Microsoft-MacOutlook/14.3.9.131030 Date: Wed, 04 Dec 2013 09:34:13 +0000 Subject: Re: Missing chosen worker error on particular superstep From: Rob Vesse To: Message-ID: Thread-Topic: Missing chosen worker error on particular superstep References: In-Reply-To: Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3468994457_89253" X-Virus-Checked: Checked by ClamAV on apache.org > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3468994457_89253 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable Kaushik I'm also running a triangle counting program of my own at the moment and have encountered similar problems. Missing working appears to indicate that one of the workers failed (AFAIK) and you'll need to look at the logs for the other map tasks to determine wh= y this is. You appear to have provided those logs and it looks to be the sam= e problem I had of out of memory. Your algorithm appears to be roughly similar to mine so what you'll be hitting is that even with this relatively small graph you get a massive explosion of messages by the later super step= s which exhausts memory (in my graph the theoretical maximum messages by the last super step was ~3 billion). In order for me to get a successful run on a single node cluster you'll likely have to do some/all of the following: * Enable out of core messages by adding appropriate configuration, see http://giraph.apache.org/ooc.html (I set max messages in memory to 10,000) * Increase the heap size for the map reduce processes, this can be done wit= h the mapred.child.java.opts setting (I used =ADXmx768M =ADXms256M). Ideally you should pick values such that the max memory times the number of workers wil= l fit into available RAM as otherwise you'll start hitting swap files and grind to a halt * Increase the number of workers you use (the =ADw argument you pass to Giraph), you'll need to have also configured your Hadoop cluster to support at least one more map tasks than the number of workers (if not using external Zookeeper) or at least as many map tasks as the number of workers if using external Zookeeper. Bear in mind that using more workers doesn't necessarily help that much on a single node cluster but it does mean you ca= n keep the heap size settings down lower. Even with all this my algorithm took 16 minutes to run on a graph with 10K vertices and 150K edges, if you have better luck I'd be interested to know. Hope this helps, Rob From: Kaushik Patnaik Reply-To: Date: Wednesday, 4 December 2013 04:18 To: Subject: Missing chosen worker error on particular superstep > Hi, >=20 > I am getting a missing chosen worker error upon running a custom program > (triangle counting) on a graph dataset of about 7K nodes and 200K edges. = The > program runs fine on the first two supersteps, but throws a missing chose= n > worker error in the third superstep. In the third superstep each node sen= ds > messages to all its neighbors. >=20 > The program runs fine and outputs results on test datasets of size 100 no= des > and 1000 edges, still I have pasted it below in case I am missing somethi= ng. >=20 > I am running the job on a single cluster machine though, could that be a > problem ?? Will increasing the "mapreduce.task.timeout" help in any manne= r ? > I also see a Java Heap Source error in the second map task of the same ru= n. > Are these errors connected ? >=20 > Regards > Kaushik >=20 > syslogs from attempt_201312031749_0011_m_000000_0 > 2013-12-03 23:10:50,661 INFO org.apache.giraph.master.MasterThread: > masterThread: Coordination of superstep 1 took 100.399 seconds ended with > state THIS_SUPERSTEP_DONE and is now on superstep 2 > 2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient: > connectAllAddresses: Successfully added 0 connections, (0 total connected= ) 0 > failed, 0 failures total. > 2013-12-03 23:10:50,774 INFO org.apache.giraph.partition.PartitionBalance= r: > balancePartitionsAcrossWorkers: Using algorithm static > 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: > analyzePartitionStats: Vertices - Mean: 7115, Min: Worker(hostname=3Dlocalh= ost, > MRtaskID=3D1, port=3D30001) - 7115, Max: Worker(hostname=3Dlocalhost, MRtaskID=3D= 1, > port=3D30001) - 7115 > 2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: > analyzePartitionStats: Edges - Mean: 201524, Min: Worker(hostname=3Dlocalho= st, > MRtaskID=3D1, port=3D30001) - 201524, Max: Worker(hostname=3Dlocalhost, MRtaskI= D=3D1, > port=3D30001) - 201524 > 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: > barrierOnWorkerList: 0 out of 1 workers finished on superstep 2 on path > /_hadoopBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir= /2/_w > orkerFinishedDir > 2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: > barrierOnWorkerList: Waiting on [localhost_1] > 2013-12-03 23:11:01,756 ERROR org.apache.giraph.master.BspServiceMaster: > superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=3Dlocalho= st, > MRtaskID=3D1, port=3D30001) on superstep 2 > 2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread: > masterThread: Coordination of superstep 2 took 11.096 seconds ended with = state > WORKER_FAILURE and is now on superstep 2 > 2013-12-03 23:11:01,761 FATAL org.apache.giraph.master.BspServiceMaster: > getLastGoodCheckpoint: No last good checkpoints can be found, killing the= job. > attempt_201312031749_0011_m_000001_0 > java.lang.IllegalStateException: run: Caught an unrecoverable exception > waitFor: ExecutionException occurred while waiting for > org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at=20 > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation= .java > :1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) > Caused by: java.lang.IllegalStateException: waitFor: ExecutionException > occurred while waiting for > org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f > at=20 > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:= 181) > at=20 > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.j= ava:1 > 39) > at=20 > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.j= ava:1 > 24) > at=20 > org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableUti= ls.ja > va:87) > at=20 > org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(Progre= ssabl > eUtils.java:221) > at=20 > org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTask= Manag > er.java:736) > at=20 > org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:28= 4) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91) > ... 7 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.OutOfMemoryError: Java heap space > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262) > at java.util.concurrent.FutureTask.get(FutureTask.java:119) > at=20 > org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(Progress= ableU > tils.java:300) > at=20 > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:= 173) >=20 > Program -=20 > public class TriangleCounting extends BasicComputation< > DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> { > =20 > /** Class logger */ > private static final Logger LOG =3D Logger.getLogger(TriangleCounting.cla= ss); > =20 > @Override > public void compute( > Vertex vertex, > Iterable messages) throws IOException { > =20 > /** First Superstep releases messages to vertexIds() whose value is gre= ater > than its value. Both VertexId and Message are double **/ > if (getSuperstep() =3D=3D 0) { > for (Edge edge: vertex.getEdges()) { > if (edge.getTargetVertexId().compareTo(vertex.getId()) =3D=3D 1) { > sendMessage(edge.getTargetVertexId(), vertex.getId()); > if (LOG.isDebugEnabled()) { > LOG.debug("Vertex " + vertex.getId() + " sent message " + > vertex.getId() + " to vertex " + edge.getTargetVertexId()); > } > System.out.println("Vertex " + vertex.getId() + " sent message " = + > vertex.getId() + " to vertex " + edge.getTargetVertexId()); > } > } > } > =20 > /** Second superstep releases messages to message.get() < vertex.getId(= ) < > targetVertexId() **/ > if (getSuperstep() =3D=3D 1) { > for (DoubleWritable message: messages) { > for (Edge edge: vertex.getEdges()) { > if (edge.getTargetVertexId().compareTo(vertex.getId()) + > vertex.getId().compareTo(message) =3D=3D 2) { > sendMessage(edge.getTargetVertexId(), message); > if (LOG.isDebugEnabled()) { > LOG.debug("Vertex " + vertex.getId() + " sent message " + mess= age + > " to vertex " + edge.getTargetVertexId()); > } > System.out.println("Vertex " + vertex.getId() + " sent message "= + > message + " to vertex " + edge.getTargetVertexId()); > } > } > } > } > /** Sends messages to all its neighbours, the messages it receives **/ > if (getSuperstep() =3D=3D 2) { > for (DoubleWritable message: messages) { > sendMessageToAllEdges(vertex, message); > } > } > =20 > if (getSuperstep() =3D=3D 3) { > double Value =3D 0.0; > for (DoubleWritable message: messages) { > if (vertex.getId().equals(message)) { > Value +=3D 1.0; > System.out.println("Vertex " + vertex.getId() + " received > message " + message); > } > } > vertex.setValue(new DoubleWritable(Value)); > }=20 > =20 > vertex.voteToHalt(); > } > } --B_3468994457_89253 Content-type: text/html; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable
Kaushik

=
I'm also running a triangle counting program of my own at the moment an= d have encountered similar problems.

Missing workin= g appears to indicate that one of the workers failed (AFAIK) and you'll need= to look at the logs for the other map tasks to determine why this is.  = ;You appear to have provided those logs and it looks to be the same problem = I had of out of memory.  Your algorithm appears to be roughly similar t= o mine so what you'll be hitting is that even with this relatively small gra= ph you get a massive explosion of messages by the later super steps which ex= hausts memory (in my graph the theoretical maximum messages by the last supe= r step was ~3 billion).

In order for me to get a su= ccessful run on a single node cluster you'll likely have to do some/all of t= he following:
  • Enable out of core messages by adding appropriate= configuration, see http://= giraph.apache.org/ooc.html (I set max messages in memory to 10,000)=
  • Increase the heap size for the map reduce processes, this can be do= ne with the mapred.child.java.opts setting (I used –Xmx768M –Xms= 256M).  Ideally you should pick values such that the max memory times t= he number of workers will fit into available RAM as otherwise you'll start h= itting swap files and grind to a halt
  • Increase the number of workers= you use (the –w argument you pass to Giraph), you'll need to have als= o configured your Hadoop cluster to support at least one more map tasks than= the number of workers (if not using external Zookeeper) or at least as many= map tasks as the number of workers if using external Zookeeper.  Bear = in mind that using more workers doesn't necessarily help that much on a sing= le node cluster but it does mean you can keep the heap size settings down lo= wer.
Even with all this my algorithm took 16 minutes to run on= a graph with 10K vertices and 150K edges, if you have better luck I'd be in= terested to know.

Hope this helps,

Rob

From: K= aushik Patnaik <kaushikpatnaik@= gmail.com>
Reply-To: <user@giraph.apache.org>
Date: Wednesday, 4 December 2013 04:18To: <user@giraph.apache.org>
Subject: Missing chosen worker error on particular superstep

Hi,

I am getting a missing chosen worker e= rror upon running a custom program (triangle counting) on a graph dataset of= about 7K nodes and 200K edges. The program runs fine on the first two super= steps, but throws a missing chosen worker error in the third superstep. In t= he third superstep each node sends messages to all its neighbors.
=
The program runs fine and outputs results on test datasets of= size 100 nodes and 1000 edges, still I have pasted it below in case I am mi= ssing something.

I am running the job on a single c= luster machine though, could that be a problem ?? Will increasing the "mapreduce.task.timeout" help = in any manner ?
I also see a Java Heap Source error in the second map task of the s= ame run. Are these errors connected ?

Regards
Kaushik

syslogs from attempt_2013120317= 49_0011_m_000000_0
2013-12-03 23:10=
:50,661 INFO org.apache.giraph.master.MasterThread: masterThread: Coordinati=
on of superstep 1 took 100.399 seconds ended with state THIS_SUPERSTEP_DONE =
and is now on superstep 2
2013-12-03 23:10:50,774 INFO org.apache.giraph.comm.netty.NettyClient: conn=
ectAllAddresses: Successfully added 0 connections, (0 total connected) 0 fai=
led, 0 failures total.
2013-12-03 23:10:50,774 INFO org.apache.giraph.partition.PartitionBalancer:=
 balancePartitionsAcrossWorkers: Using algorithm static
2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: an=
alyzePartitionStats: Vertices - Mean: 7115, Min: Worker(hostname=3Dlocalhost, =
MRtaskID=3D1, port=3D30001) - 7115, Max: Worker(hostname=3Dlocalhost, MRtaskID=3D1, =
port=3D30001) - 7115
2013-12-03 23:10:50,775 INFO org.apache.giraph.partition.PartitionUtils: an=
alyzePartitionStats: Edges - Mean: 201524, Min: Worker(hostname=3Dlocalhost, M=
RtaskID=3D1, port=3D30001) - 201524, Max: Worker(hostname=3Dlocalhost, MRtaskID=3D1,=
 port=3D30001) - 201524
2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: bar=
rierOnWorkerList: 0 out of 1 workers finished on superstep 2 on path /_hadoo=
pBsp/job_201312031749_0011/_applicationAttemptsDir/0/_superstepDir/2/_worker=
FinishedDir
2013-12-03 23:10:50,806 INFO org.apache.giraph.master.BspServiceMaster: bar=
rierOnWorkerList: Waiting on [localhost_1]
2013-12-03 23:11:01,756 ERROR org.apache.giraph.master.BspServiceMaster: su=
perstepChosenWorkerAlive: Missing chosen worker Worker(hostname=3Dlocalhost, M=
RtaskID=3D1, port=3D30001) on superstep 2
2013-12-03 23:11:01,757 INFO org.apache.giraph.master.MasterThread: masterT=
hread: Coordination of superstep 2 took 11.096 seconds ended with state WORK=
ER_FAILURE and is now on superstep 2
2013-12-03 23:11:01,761 FATAL org.apache.giraph.master.BspServiceMaster: ge=
tLastGoodCheckpoint: No last good checkpoints can be found, killing the job.=
attempt_201312031749_0011_m_000001_0
java.lang.IllegalStateException: run: Caught an un=
recoverable exception waitFor: ExecutionException occurred while waiting for=
 org.apache.giraph.utils.ProgressableUtils$FutureWaitable@61db327f
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1059)
	at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: waitFor: ExecutionException occ=
urred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWait=
able@61db327f
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.jav=
a:181)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils=
.java:139)
	at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils=
.java:124)
	at org.apache.giraph.utils.ProgressableUtils.getFutureResult(ProgressableU=
tils.java:87)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(Prog=
ressableUtils.java:221)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTa=
skManager.java:736)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:=
284)
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
	... 7 more
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryEr=
ror: Java heap space
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
	at java.util.concurrent.FutureTask.get(FutureTask.java:119)
	at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(Progre=
ssableUtils.java:300)
	at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.jav=
a:173)

Program - 
public class Triangle=
Counting extends BasicComputation<
    DoubleWritable, DoubleWritable, FloatWritable, DoubleWritable> {
  
  /** Class logger */
  private static final Logger LOG =3D Logger.getLogger(TriangleCounting.class=
);
    
  @Override
  public void compute(
      Vertex<DoubleWritable, DoubleWritable, FloatWritable> vertex,
      Iterable<DoubleWritable> messages) throws IOException {
		  
  /** First Superstep releases messages to vertexIds() whose value is great=
er than its value. Both VertexId and Message are double **/		  
  if (getSuperstep() =3D=3D 0) {
    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges())=
 {
	  if (edge.getTargetVertexId().compareTo(vertex.getId()) =3D=3D 1) {
		sendMessage(edge.getTargetVertexId(), vertex.getId());
        if (LOG.isDebugEnabled()) {
          LOG.debug("Vertex " + vertex.getId() + " sent message " + vertex.=
getId() + " to vertex " + edge.getTargetVertexId());
        }
        System.out.println("Vertex " + vertex.getId() + " sent message " + =
vertex.getId() + " to vertex " + edge.getTargetVertexId());
	  }
    }
  }
    
  /** Second superstep releases messages to message.get() < vertex.getId=
() < targetVertexId() **/  
  if (getSuperstep() =3D=3D 1) {
	for (DoubleWritable message: messages) {
	    for (Edge<DoubleWritable, FloatWritable> edge: vertex.getEdges()=
) {
		  if (edge.getTargetVertexId().compareTo(vertex.getId()) + vertex.getId()=
.compareTo(message) =3D=3D 2) {
		    sendMessage(edge.getTargetVertexId(), message);
	        if (LOG.isDebugEnabled()) {
	          LOG.debug("Vertex " + vertex.getId() + " sent message " + messag=
e + " to vertex " + edge.getTargetVertexId());
	        }
	        System.out.println("Vertex " + vertex.getId() + " sent message " +=
 message + " to vertex " + edge.getTargetVertexId());
		  }
	    }
    }
  }
  /** Sends messages to all its neighbours, the messages it receives **/
  if (getSuperstep() =3D=3D 2) {
    for (DoubleWritable message: messages) {
		sendMessageToAllEdges(vertex, message);
	}
  }
  
  if (getSuperstep() =3D=3D 3) {
      double Value =3D 0.0;
	  for (DoubleWritable message: messages) {
		  if (vertex.getId().equals(message)) {
              Value +=3D 1.0;
              System.out.println("Vertex " + vertex.getId() + " received me=
ssage " + message);
	      }
      }
      vertex.setValue(new DoubleWritable(Value));
  } 
    
  vertex.voteToHalt();
  }
}
--B_3468994457_89253--