Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 218EA200D14 for ; Mon, 28 Aug 2017 19:22:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 201D3165433; Mon, 28 Aug 2017 17:22:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 69A23165437 for ; Mon, 28 Aug 2017 19:22:05 +0200 (CEST) Received: (qmail 97125 invoked by uid 500); 28 Aug 2017 17:22:03 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Received: (qmail 97095 invoked by uid 500); 28 Aug 2017 17:22:03 -0000 Delivered-To: apmail-incubator-giraph-dev@incubator.apache.org Received: (qmail 97092 invoked by uid 99); 28 Aug 2017 17:22:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Aug 2017 17:22:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0E088CA993 for ; Mon, 28 Aug 2017 17:22:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 8E6ET1Ser8ay for ; Mon, 28 Aug 2017 17:22:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 1242A61126 for ; Mon, 28 Aug 2017 17:22:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2A842E0EB6 for ; Mon, 28 Aug 2017 17:22:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6E8B62538E for ; Mon, 28 Aug 2017 17:22:00 +0000 (UTC) Date: Mon, 28 Aug 2017 17:22:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 28 Aug 2017 17:22:06 -0000 [ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144084#comment-16144084 ] ASF GitHub Bot commented on GIRAPH-1139: ---------------------------------------- Github user edunov commented on the issue: https://github.com/apache/giraph/pull/30 I've committed this one. Let me know if there are any issues. > Resuming from checkpoint doesn't work > ------------------------------------- > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp > Affects Versions: 1.2.0 > Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, which stays the same across multiple attempts. This gets transferred to the Netty clientId, and the server starts ignoring messages from restarted workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)