giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Kimbrel <lekimb...@gmail.com>
Subject BspServiceMaster: cleanedUpZooKeeper infinite loop
Date Thu, 30 Jan 2014 17:49:33 GMT
Hello,  I am currently not a contributor to this project but have noticed an issue i wanted
to report here instead of on the users mailing list.

using 1.1.0-SNAPSHOT built for PURE YARN and cdh5.0.0

I have an intermittent problem that, when it occurs, causes the job to stall after completion
(but prior to vertices writing their output).   Looking into the logs (posted below) I see
that i go from 7 of 8 workers reporting completion to 9 of 8.  The code in BspServiceMaster:1740
users cleanedUpChildrenList.size() == maxTasks inside of a while true loop, so the job gets
stuck here forever and will never progress again.

I plan on changing this locally to a >= for my own use to prevent this problem, but i don’t
know how 9 of 8 is being reported and how this problem is really happening.  

Thanks for any ideas,
Eric


14/01/30 09:35:37 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 1 of 8 desired children
from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:37 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children
of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only
got 1 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 2 of 8 desired children
from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children
of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only
got 2 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 5 of 8 desired children
from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children
of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only
got 5 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 6 of 8 desired children
from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children
of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only
got 6 nodes.
14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 9 of 8 desired children
from /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir
14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: Waiting for the children
of /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to change since only
got 9 nodes.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message