giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
Date Mon, 17 Apr 2017 16:33:41 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971306#comment-15971306
] 

ASF GitHub Bot commented on GIRAPH-1139:
----------------------------------------

Github user edunov commented on a diff in the pull request:

    https://github.com/apache/giraph/pull/30#discussion_r111767933
  
    --- Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
    @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) {
         if (checkpointFrequency == 0) {
           return CheckpointStatus.NONE;
         }
    -    long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
    +    long firstCheckpoint = INPUT_SUPERSTEP + 1;
    --- End diff --
    
    What is the reason for changing this? Do you want it to always do checkpoint after the
first superstep?


> Resuming from checkpoint doesn't work
> -------------------------------------
>
>                 Key: GIRAPH-1139
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-1139
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 1.2.0
>            Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using
mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint again,
while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, which stays
the same across multiple attempts. This gets transferred to the Netty clientId, and the server
starts ignoring messages from restarted workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message