Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4C273200B9F for ; Tue, 11 Oct 2016 12:40:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 49005160AE6; Tue, 11 Oct 2016 10:40:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 36FB8160AD2 for ; Tue, 11 Oct 2016 12:40:08 +0200 (CEST) Received: (qmail 74599 invoked by uid 500); 11 Oct 2016 10:40:07 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 74589 invoked by uid 99); 11 Oct 2016 10:40:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2016 10:40:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C94C2C0C69 for ; Tue, 11 Oct 2016 10:40:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.648 X-Spam-Level: ** X-Spam-Status: No, score=2.648 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ZtAGuNcf3wED for ; Tue, 11 Oct 2016 10:40:04 +0000 (UTC) Received: from mail-qk0-f176.google.com (mail-qk0-f176.google.com [209.85.220.176]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C8E535FBCF for ; Tue, 11 Oct 2016 10:40:03 +0000 (UTC) Received: by mail-qk0-f176.google.com with SMTP id f128so25335148qkb.1 for ; Tue, 11 Oct 2016 03:40:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=T2oSDKR68Egv1bxS8DkizJdbFIFxILnzr8cAGJoEIhg=; b=n9RPTxoPEaBiO1CodG5EYw180omxGivvuSHqE2vApUZSFMHWpqvSJA23w6LY1XJbx0 w5sJgOniZBbiJKbyGynFcRy+nFShXtQogpCZzI7JOb4W9A9Ui38Unh3tEe9vVlKUGwYu 7YIu9LEZdNJ/9VAc7YJpFeKv8ys7hD260b/PZCjW8Vrn114LQ7MpNxn0RVXobttGo+mL YRrA5S1xaZWy1+E/vnKhx1HvkQ8gZ0ujU9XBpi5YvEQZTyVfb04XvpdoIV0b3bsvXCkk hhpYvUPxUXtm11B1Frm/SYKHDlzIwFUi8s4EtXcGi1AllNQ0MwFmjf4E4S/juEmvJ/5P cX7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=T2oSDKR68Egv1bxS8DkizJdbFIFxILnzr8cAGJoEIhg=; b=k4Y6tTxWNGQP2SHb3mR1exJLFoKBedW52EmQncA3dsQFi2+jFgHuzm1IchHfiwoJFC cRo8KwJAImBmyD/haqNeaiD1DxpjwoBQf8GIKIynt+RSaAEkedFe2iz5WNJmyDiMRcP9 3OWqAg8jCUpfIyT8MzA3I8wuWNaq2bvJWwTKzZQKK6Igfu9Es3sDoTBTUspaHja5lfeB 1pAEl3bnnH7RLs1s/PGjUFTKzX0M/Mk2PN8uGWyOCNbNknUY38X58s6gEyldvnHIWgNk fKq9zscxg9bmaOUV1yqlS8isSJwLx70NbDbc5dRLY/c+EPp5gsPTqH8bH53+ja6j3jx/ CUBg== X-Gm-Message-State: AA6/9RlbVnpF7r1k1xSFEZmLLFTa9wXwB2oXglpg1Z/IDPevs5NhmsVYWkZbmDKJcgTfVHMrG9WDRjT2yPNtBA== X-Received: by 10.194.170.163 with SMTP id an3mr4159031wjc.73.1476182402609; Tue, 11 Oct 2016 03:40:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.80.135.80 with HTTP; Tue, 11 Oct 2016 03:40:02 -0700 (PDT) From: Josh Date: Tue, 11 Oct 2016 11:40:02 +0100 Message-ID: Subject: Exception when restoring state from RocksDB - how to recover? To: user@flink.apache.org Content-Type: multipart/alternative; boundary=089e0122f07cbb36a6053e947fba archived-at: Tue, 11 Oct 2016 10:40:09 -0000 --089e0122f07cbb36a6053e947fba Content-Type: text/plain; charset=UTF-8 Hi all, I just have a couple of questions about checkpointing and restoring state from RocksDB. 1) In some cases, I find that it is impossible to restore a job from a checkpoint, due to an exception such as the one pasted below[*]. In this case, it appears that the last checkpoint is somehow corrupt. Does anyone know why this might happen? 2) When the above happens, I have no choice but to cancel the job, as it repeatedly attempts to restart and keeps getting the same exception. Given that no savepoint was taken recently, is it possible for me to restore the job from an older checkpoint (e.g. the second-last checkpoint)? The version of Flink I'm using Flink-1.1-SNAPSHOT, from mid-June. Thanks, Josh [*]The exception when restoring state: java.lang.Exception: Could not restore checkpointed state to operators and functions at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:480) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:219) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Error while restoring RocksDB state from /mnt/yarn/usercache/hadoop/appcache/application_1476181294189_0001/flink-io-09ad1cb1-8dff-4f9a-9f61-6cae27ee6f1d/d236820a793043bd63360df6f175cae9/StreamFlatMap_9_8/dummy_state/dc5beab1-68fb-48b3-b3d6-272497d15a09/chk-1 at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:537) at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.injectKeyValueStateSnapshots(RocksDBStateBackend.java:489) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.restoreState(AbstractStreamOperator.java:204) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:154) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:472) ... 3 more Caused by: org.rocksdb.RocksDBException: NotFound: Backup not found at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(Native Method) at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(BackupEngine.java:177) at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFromSemiAsyncSnapshot(RocksDBStateBackend.java:535) ... 7 more --089e0122f07cbb36a6053e947fba Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all,

I just have a couple of qu=
estions about checkpointing and restoring state from RocksDB.

<= /pre>
1) In some cases, I find that it is impossible to restore a job fro=
m a checkpoint, due to an exception such as the one pasted below[*]. In thi=
s case, it appears that the last checkpoint is somehow corrupt. Does anyone=
 know why this might happen?

2) When the above happens=
, I have no choice but to cancel the job, as it repeatedly attempts to rest=
art and keeps getting the same exception. Given that no savepoint was taken=
 recently, is it possible for me to restore the job from an older checkpoin=
t (e.g. the second-last checkpoint)?

The version of Fl=
ink I'm using Flink-1.1-SNAPSHOT, from mid-June. 

= Thanks,
Josh

[*]The exception when restoring=
 state:
java.lang.Exception: Could not restore checkpointed state t=
o operators and functions
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(Stream=
Task.java:480)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.j=
ava:219)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error while restoring RocksDB state =
from /mnt/yarn/usercache/hadoop/appcache/application_1476181294189_0001/fli=
nk-io-09ad1cb1-8dff-4f9a-9f61-6cae27ee6f1d/d236820a793043bd63360df6f175cae9=
/StreamFlatMap_9_8/dummy_state/dc5beab1-68fb-48b3-b3d6-272497d15a09/chk-1
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFro=
mSemiAsyncSnapshot(RocksDBStateBackend.java:537)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.injectKeyV=
alueStateSnapshots(RocksDBStateBackend.java:489)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.restore=
State(AbstractStreamOperator.java:204)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.rest=
oreState(AbstractUdfStreamOperator.java:154)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(Stream=
Task.java:472)
	... 3 more
Caused by: org.rocksdb.RocksDBException: NotFound: Backup not found
	at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(Native Method)
	at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(BackupEngine.java:17=
7)
	at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.restoreFro=
mSemiAsyncSnapshot(RocksDBStateBackend.java:535)
	... 7 more
--089e0122f07cbb36a6053e947fba--