giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Luis Larroque <larroques...@gmail.com>
Subject Re: RELATION BETWEEN THE NUMBER OF GIRAPH WORKERS AND THE PROBLEM SIZE
Date Sat, 25 Feb 2017 13:45:27 GMT
Hi Ganesh,

For some reason, some of your workers are dying. When that happens, giraph
automatically detects that the amount of workers is below neccesary on "
barrierOnWorkerList" and search if a checkpoint exists (a checkpoint is a
backup of the state of a Giraph application). You don't have checkpointing
enabled apparently, so the entire job is being killed. I reccomend that you
look in your containers logs and try to detect why one or more workers are
dying when you have bigger files.

Bye!



-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-02-25 3:24 GMT-03:00 Sai Ganesh Muthuraman <saiganeshpsn@gmail.com>:

> Hi,
>
> I used one worker per node and that worked for smaller files. When the
> file size was more than 25 MB, I got this strange exception. I tried using
> 2 nodes and 3 nodes, the result is the same.
>
> *ERROR* [org.apache.giraph.master.MasterThread] master.BspServiceMaster
> (BspServiceMaster.java:barrierOnWorkerList(1415)) - barrierOnWorkerList:*
> Missing chosen workers *[Worker(hostname=comet-10-68.sdsc.edu,
> MRtaskID=2, port=30002)] on superstep 2
> *FATAL* [org.apache.giraph.master.MasterThread] master.BspServiceMaster
> (BspServiceMaster.java:getLastGoodCheckpoint(1291)) -
> getLastGoodCheckpoint: No last good checkpoints can be found, killing the
> job.
> java.io.FileNotFoundException: File hdfs://comet-10-33.ibnet:
> 54310/user/saiganes/_bsp/_checkpoints/giraph_yarn_application_1488002378889_0001
> does not exist.
>         at org.apache.hadoop.hdfs.DistributedFileSystem.
> listStatusInternal(DistributedFileSystem.java:697)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.access$
> 600(DistributedFileSystem.java:105)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$15.
> doCall(DistributedFileSystem.java:755)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$15.
> doCall(DistributedFileSystem.java:751)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(
> DistributedFileSystem.java:751)
>         at org.apache.hadoop.fs.FileSystem.listStatus(
> FileSystem.java:1485)
>         at org.apache.hadoop.fs.FileSystem.listStatus(
> FileSystem.java:1525)
>         at org.apache.giraph.utils.CheckpointingUtils.
> getLastCheckpointedSuperstep(CheckpointingUtils.java:107)
>         at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(
> BspService.java:1196)
>         at org.apache.giraph.master.BspServiceMaster.
> getLastGoodCheckpoint(BspServiceMaster.java:1289)
>         at org.apache.giraph.master.MasterThread.run(MasterThread.
> java:149)
>
>
> - Sai Ganesh
>
>

Mime
View raw message