giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Fonseca (JIRA)" <>
Subject [jira] [Updated] (GIRAPH-788) Giraph job suspends with exceptions when out-of-core options are set
Date Mon, 03 Feb 2014 17:30:08 GMT


Alexandre Fonseca updated GIRAPH-788:

    Attachment: GIRAPH-788.patch

Stumbled across this same issue today with exactly the same stacktrace.

Function getOrCreatePartition(id) in DiskBackedPartitionStore:228-238 is the parent in the
stacktrace of the NPR. Looking at the code:

      Partition<I, V, E> partition =
          pool.submit(new GetPartition(id)).get();
      if (partition == null) {
        Partition<I, V, E> newPartition =
            conf.createPartition(id, context);
            new AddPartition(id, newPartition)).get();
        return newPartition;
      } else {
        return partition;

it is obvious that the intent is getting a partition if it exists or adding a new one if it
doesn't. However, in the call() method of GetPartition (DiskBackedPartitionStore:695), the
states HashMap is accessed directly without any ID check. Since states are only created for
existing partitions, the result of this access is a null pointer. Switching on that null pointer
causes the NPR:

      while (partition == null) {
        try {
          State pState = states.get(id);
          switch (pState) {

By adding a check before this direct access for the existence of the partition, I was able
to circumvent the NPR and obtain the expected result. This patch implements that check and
was tested against a 5-node Hadoop 2.2.0 cluster using both MR2 and YARN executions.

> Giraph job suspends with exceptions when out-of-core options are set
> --------------------------------------------------------------------
>                 Key: GIRAPH-788
>                 URL:
>             Project: Giraph
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 1.0.0
>         Environment: uses hadoop with 32 cluster nodes
> Giraph release-1.0 pulled Oct. 29. 2013. 
>            Reporter: Byungnam Lim
>         Attachments: GIRAPH-788.patch
> When I run my code with out-of-core graph/message options OFF, it's fine. But when out-of-core
graph/message options ON, then some workers give me exception messages like below and whole
tasks suspends.
> {noformat}
> java.lang.IllegalStateException: run: Caught an unrecoverable exception waitFor: ExecutionException
occurred while waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3c7659ab
> 	at
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(
> 	at
> 	at org.apache.hadoop.mapred.Child$
> 	at Method)
> 	at
> 	at
> 	at org.apache.hadoop.mapred.Child.main(
> Caused by: java.lang.IllegalStateException: waitFor: ExecutionException occurred while
waiting for org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3c7659ab
> 	at org.apache.giraph.utils.ProgressableUtils.waitFor(
> 	at org.apache.giraph.utils.ProgressableUtils.waitForever(
> 	at org.apache.giraph.utils.ProgressableUtils.waitForever(
> 	at org.apache.giraph.utils.ProgressableUtils.getFutureResult(
> 	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(
> 	at org.apache.giraph.worker.BspServiceWorker.loadInputSplits(
> 	at org.apache.giraph.worker.BspServiceWorker.loadVertices(
> 	at org.apache.giraph.worker.BspServiceWorker.setup(
> 	at org.apache.giraph.graph.GraphTaskManager.execute(
> 	at
> 	... 7 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException:
getOrCreatePartition: cannot retrieve partition 6
> 	at java.util.concurrent.FutureTask$Sync.innerGet(
> 	at java.util.concurrent.FutureTask.get(
> 	at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(
> 	at org.apache.giraph.utils.ProgressableUtils.waitFor(
> 	... 16 more
> Caused by: java.lang.IllegalStateException: getOrCreatePartition: cannot retrieve partition
> 	at org.apache.giraph.partition.DiskBackedPartitionStore.getOrCreatePartition(
> 	at org.apache.giraph.comm.requests.SendWorkerVerticesRequest.doRequest(
> 	at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(
> 	at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendVertexRequest(
> 	at org.apache.giraph.worker.VertexInputSplitsCallable.readInputSplit(
> 	at org.apache.giraph.worker.InputSplitsCallable.loadInputSplit(
> 	at
> 	at
> 	at
> 	at java.util.concurrent.FutureTask$Sync.innerRun(
> 	at
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(
> 	at java.util.concurrent.ThreadPoolExecutor$
> 	at
> Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
> 	at java.util.concurrent.FutureTask$Sync.innerGet(
> 	at java.util.concurrent.FutureTask.get(
> 	at org.apache.giraph.partition.DiskBackedPartitionStore.getOrCreatePartition(
> 	... 13 more
> Caused by: java.lang.NullPointerException
> 	at org.apache.giraph.partition.DiskBackedPartitionStore$
> 	at org.apache.giraph.partition.DiskBackedPartitionStore$
> 	at java.util.concurrent.FutureTask$Sync.innerRun(
> 	at
> 	at org.apache.giraph.partition.DiskBackedPartitionStore$DirectExecutorService.execute(
> 	at java.util.concurrent.AbstractExecutorService.submit(
> 	... 14 more
> {noformat}
> This exception occurs when superstep = -1.
> Strange things are that i) when I give option to run the job with equal or less than
10 workers, or ii ) when I run one of the example codes in giraph-examples - particularly,
SimpleShortestPath with 32 workers, the job finishes fine. The exceptions only occur when
I run my own code with larger than 10 workers. Then it goes out of the way.
> I found that there was a similar - yet as far as I know, the very same problem before
in GIRAPH-462, but the issue is marked as 'Resolved' and 'Fixed'. Does this issue really fixed
and am I just doing wrong?
> My input size was 75 MBytes with about 1 million nodes but I tested and found this problem
does not depends on the input sizes.

This message was sent by Atlassian JIRA

View raw message