giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Dudinski <denis.dudin...@gmail.com>
Subject Automatic restart from checkpoint after worker failure on YARN
Date Wed, 28 Nov 2018 12:27:21 GMT
Hi!

I am running Giraph with YARN. Checkpointing is enabled. But when worker failure happens master
node outputs:

18/11/28 12:52:31 INFO master.MasterThread: masterThread: Coordination of superstep 3 took
0.094 seconds ended with state WORKER_FAILURE and is now on superstep 3
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: {"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2}
on superstep 2
18/11/28 12:52:31 INFO master.BspServiceMaster: setJobState: {"_applicationAttemptKey":1,"_stateKey":"START_SUPERSTEP","_superstepKey":2}
18/11/28 12:52:31 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ONLY checkWorkers: Only
found 0 responses of 2 needed to start superstep 2
After a while it fails job since timeout expires and no workers are present. 

Is it possible to use automatic checkpoint resuming without falling back from YARN to MR driver?

Best Regards,
Denis Dudinski
Mime
View raw message