giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Fonseca (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-806) Application goes into infinite loop after success, waits for: Waiting for org.apache.giraph.utils.ProgressableUtils
Date Mon, 09 Dec 2013 22:44:07 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843640#comment-13843640
] 

Alexandre Fonseca commented on GIRAPH-806:
------------------------------------------

Yes GIRAPH-811 covers a different problem entirely. This problem appears to be related with
Netty. In particular the application seems to get stuck waiting for an opened channel to close
in line 360 of NettyServer.java:

{code}ProgressableUtils.awaitChannelGroupFuture(accepted.close(), progressable);{code}

Unfortunately, I'm not familiar with Netty nor can I replicate the problem under a pure Yarn
compilation. Were there no indications of other errors/client crashes that might have kept
a channel indefinitely open?

Another thing that seems strange to me is, in ProgressableUtils.java's waitFor(...) function,
the following flow:

{code}
    int currentWaitMsecs;
    while (true) {
      progressable.progress();
      currentWaitMsecs = Math.min(msecs, msecsPeriod);
      try {
        waitable.waitFor(currentWaitMsecs);
        if (waitable.isFinished()) {
          return waitable.getResult();
        }
      } 
      // (...)
      msecs = Math.max(0, msecs - currentWaitMsecs);
    }
{code}

Given that the function waitForever in line 136 of the same file is the one that calls waitFor()
and supplies msecs = msecsPeriod = DEFUALT_MSEC_PERIOD = 60 * 1000, then we'll have the following:

* First iteration
** currentWaitMsecs = msecs = 60 * 1000
** Wait for 60 seconds 
** msecs = Math.max(0, msecs - currentWaitMsecs) = Math.max(0, msecs - msecs) = 0
* Second (and following) iterations
** currentWaitMsecs = msecs = 0
** Wait for 0 seconds
** msecs = 0

So after 1 minute, if the Waitable we're waiting for still hasn't finished, we start effectively
hammering it thousands of times per second? Not only is this a waste of CPU cycles, it might
incur on some starvation issues which might keep the Waitable from finishing somehow?

> Application goes into infinite loop after success, waits for: Waiting for org.apache.giraph.utils.ProgressableUtils
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-806
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-806
>             Project: Giraph
>          Issue Type: Bug
>          Components: examples, mapreduce
>    Affects Versions: 1.1.0
>            Reporter: Rafal Wojdyla
>
> Run example:
> {noformat}
> hadoop jar giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2.0.2.0.6.0-76-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -D giraph.zkList="zkNode.net:2181" -D yarn.app.mapreduce.am.log.level=DEBUG
-D mapreduce.map.log.level=DEBUG -D mapreduce.reduce.log.level=DEBUG org.apache.giraph.examples.SimplePageRankComputation
-w 1 -yh 1024 -yj ./giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2.0.2.0.6.0-76-jar-with-dependencies.jar
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute -vip
/user/rav/giraph/input/giraph.in -op /user/rav/giraph/output
> {noformat}
> After application successfully computer result it goes to infinite loop of:
> {noformat}
> 2013-11-28 20:09:51,764 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
setJobState: {"_stateKey":"FINISHED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep
31
> 2013-11-28 20:09:51,821 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanup: Notifying master its okay to cleanup with /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir/0_master
> 2013-11-28 20:09:51,829 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanUpZooKeeper: Node /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir already exists, no
need to create.
> 2013-11-28 20:09:51,831 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanUpZooKeeper: Got 1 of 2 desired children from /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir
> 2013-11-28 20:09:51,831 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanedUpZooKeeper: Waiting for the children of /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir
to change since only got 1 nodes.
> 2013-11-28 20:09:51,909 INFO [main-EventThread] org.apache.giraph.bsp.BspService: process:
cleanedUpChildrenChanged signaled
> 2013-11-28 20:09:51,910 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanUpZooKeeper: Got 2 of 2 desired children from /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir
> 2013-11-28 20:09:51,911 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanupZooKeeper: Removing the following path and all children - /_hadoopBsp/job_1385571160481_11413
from ZooKeeper list zkNode.net:2181
> 2013-11-28 20:09:52,329 INFO [main-EventThread] org.apache.giraph.bsp.BspService: process:
masterElectionChildrenChanged signaled
> 2013-11-28 20:09:52,379 INFO [main-EventThread] org.apache.giraph.bsp.BspService: process:
cleanedUpChildrenChanged signaled
> 2013-11-28 20:09:52,447 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
cleanup: Removed HDFS checkpoint directory (_bsp/_checkpoints//job_1385571160481_11413) with
return = false since the job Giraph: org.apache.giraph.examples.SimplePageRankComputation
succeeded 
> 2013-11-28 20:09:52,449 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient:
stop: reached wait threshold, 1 connections closed, releasing NettyClient.bootstrap resources
now.
> 2013-11-28 20:09:52,454 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyServer:
stop: Halting netty server
> 2013-11-28 20:10:52,463 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:11:52,463 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:12:52,463 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:13:52,464 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:14:52,464 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:15:52,464 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:16:52,465 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:17:52,465 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:18:52,465 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:19:52,466 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:20:52,466 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:21:52,466 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:22:52,467 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.utils.ProgressableUtils:
waitFor: Waiting for org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message