giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Yan <ywsk...@gmail.com>
Subject Have a question regarding restart from last checkpoint
Date Wed, 30 Apr 2014 03:32:33 GMT
Hi, guys,

I have a question regarding how Giraph restarts from last checkpoint due to
worker_failure.

I run an example with 5 workers and 1 master. Two workers are preempted
during running. But I found the other 3 workers also quit. I check the
code, and find the following in the
BspServiceWorker.processEvent(WatchedEvent event):

if ((ApplicationState.valueOf(jsonObj.getString(JSONOBJ_STATE_KEY)) ==
    ApplicationState.START_SUPERSTEP) &&
    jsonObj.getLong(JSONOBJ_APPLICATION_ATTEMPT_KEY) !=
    getApplicationAttempt()) {
        LOG.fatal("processEvent: Worker will restart " +
            "from command - " + jsonObj.toString());
        System.exit(-1);
}

Does this mean all ''good'' workers also need to quit and the job needs to
request resources again? BTW, I use the pure-YARN with
Giraph-1.1.0-SNAPSHOT.

The following is the log from one "good" worker:

2014-04-29 21:56:55,284 INFO  [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1604)) - processEvent: Job state
changed, checking to see if it needs to restart
2014-04-29 21:56:55,285 INFO  [main-EventThread] bsp.BspService
(BspService.java:getJobState(695)) - getJobState: Job state already exists
(/_hadoopBsp/giraph_yarn_application_1398826558049_0001/_masterJobState)
2014-04-29 21:56:55,287 FATAL [main-EventThread] worker.BspServiceWorker
(BspServiceWorker.java:processEvent(1619)) - processEvent: Worker will
restart from command -
{"_stateKey":"START_SUPERSTEP","_applicationAttemptKey":1,"_superstepKey":24}

Thanks for help!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message