From user-return-27294-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed Apr 24 12:09:45 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B0A43180668 for ; Wed, 24 Apr 2019 14:09:44 +0200 (CEST) Received: (qmail 69493 invoked by uid 500); 24 Apr 2019 12:09:42 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 69445 invoked by uid 99); 24 Apr 2019 12:09:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Apr 2019 12:09:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A3639C2905 for ; Wed, 24 Apr 2019 12:09:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.885 X-Spam-Level: * X-Spam-Status: No, score=1.885 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.165, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Jz_0TK2t_84V for ; Wed, 24 Apr 2019 12:09:40 +0000 (UTC) Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2EBFD5F5BC for ; Wed, 24 Apr 2019 12:09:40 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id c190so6855559qke.9 for ; Wed, 24 Apr 2019 05:09:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=pou0wwOyA8dWVmq6dnhFAtKDboNxb4fcE8aSkal3paU=; b=FPI0FEWV3eps6UkFnzpWJbWETMGwiVPBHDZ5C1HzpfIC8Aj3bo4wwpNSXJKe7DAz5t 1W2WePAxjEKh2D59Opvj1gUV8hRbSCmcKDk+PLW3WdUkCv/Mdw64btwQvmINh/Z200ew 27mFu4jfBcS/ZOoLTT/0DBm0oUZiOF1Rst2S0o16ZlrtxiMk4Rrnj/L3CriQAhSwuZ9Q aHwup37+Pf7+I5b9+aOis0wAi5c6qvx7fugklRD8goYqrj/JXvZpyHQMWnGwM1zHe3Xu ESCwy52vz2otZrl/6iSGs9gxiZ+pf6NZSnXqoGKral8VPHqYcBCJsNfxz0p5zSvCcbnA XXOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=pou0wwOyA8dWVmq6dnhFAtKDboNxb4fcE8aSkal3paU=; b=Qv2rgnt2MWqME0ptVa6ZRp2OfDq2sgXDbg5HiLSmCT8Pe3VERDKv7D6i5JDIJox/Zo rCaN4el++u9zXR1L8ksJOfYcyFJXzp1HK9zamra/SdAPMnxPeSjD9eZBnNwR3LyN3RDS 6sEcjDAHR1LlY8WTV2nbZFZzRGq18sCvimDMiVzQAwSpFpzw8V+Ri9Yhb0k7WI0Z5lXN pvkmAjAlo80Q2DhStns+WwPtY9c8+quZ9EH6xhvd1iXf8sPfnBUqyVrPMIjuZ4bAHppJ kvPHw7Av0CYq3MypjgopHtFbMCaF4gIAo8YjSVtR1QN4RBk6AEdHJ8ByauJM/dBBtQOo FoSQ== X-Gm-Message-State: APjAAAWUSBqg3aRQ2q1VMOKv81gdq0tLljRB7bXpgY4tRURpH9UlMujZ 2FmFk/88PMX++d05lSCWD0g= X-Google-Smtp-Source: APXvYqyHVqTb1Q3KA0puhWWcACRDlB4o15jrwR5N2WFImtCh10tmaPVyeGNfwN7SnoqFXqt3k3Q0Pw== X-Received: by 2002:a37:9747:: with SMTP id z68mr25519618qkd.245.1556107773828; Wed, 24 Apr 2019 05:09:33 -0700 (PDT) Received: from [10.246.45.253] (mobile-166-172-63-124.mycingular.net. [166.172.63.124]) by smtp.gmail.com with ESMTPSA id q51sm12060585qtc.38.2019.04.24.05.09.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Apr 2019 05:09:33 -0700 (PDT) Content-Type: multipart/alternative; boundary=Apple-Mail-487B0B94-5F46-4B04-BB78-367BAE96B61D Mime-Version: 1.0 (1.0) Subject: Re: Missing state in RocksDB checkpoints From: Ning Shi X-Mailer: iPhone Mail (16E227) In-Reply-To: Date: Wed, 24 Apr 2019 08:09:30 -0400 Cc: Congxian Qiu , user , Stefan Richter Content-Transfer-Encoding: 7bit Message-Id: <45C1DC63-5AC0-4AE7-BC8D-73830642BAA9@gmail.com> References: <8e48f30e-b200-44c5-9e76-4dad7a7bb658@Spark> To: Till Rohrmann --Apple-Mail-487B0B94-5F46-4B04-BB78-367BAE96B61D Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Till, Thank you for escalating this to blocker. I agree that data loss is always a= serious issue. For reference, the workaround is to unchain the stateful operators. To make t= he new job be able to recover from previous checkpoint, we also had to chang= e the UID of the operator that was missing state and recover with allow non-= restored argument. Otherwise, it would fail with RocksDB errors on restore. =E2=80=94 Ning > On Apr 24, 2019, at 5:02 AM, Till Rohrmann wrote: >=20 > Thanks for reporting this issue Ning. I think this is actually a blocker f= or the next release and should be fixed right away. For future reference her= e is the issue [1]. >=20 > I've also pulled in Stefan who knows these components very well. >=20 > [1] https://issues.apache.org/jira/browse/FLINK-12296 >=20 > Cheers, > Till >=20 >> On Tue, Apr 23, 2019 at 5:24 PM Ning Shi wrote: >> On Tue, 23 Apr 2019 10:53:52 -0400, >> Congxian Qiu wrote: >> > Sorry for the misleading, in the previous email, I just want to say the= problem is not caused by the UUID generation, it is caused by the different= operators share the same directory(because currentlyFlink uses JobVertx as t= he directory) >>=20 >> Ah, thank you for the clarification, Congxian. That makes sense. >>=20 >> Ning --Apple-Mail-487B0B94-5F46-4B04-BB78-367BAE96B61D Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Till,

Thank you for esca= lating this to blocker. I agree that data loss is always a serious issue.

For reference, the workaround is to unchain the state= ful operators. To make the new job be able to recover from previous checkpoi= nt, we also had to change the UID of the operator that was missing state and= recover with allow non-restored argument. Otherwise, it would fail with Roc= ksDB errors on restore.

=E2=80=94
Ning

On Apr 24, 2019, at 5= :02 AM, Till Rohrmann <trohrmann@= apache.org> wrote:

Thanks for reporting this issue Ning. I think this is= actually a blocker for the next release and should be fixed right away. For= future reference here is the issue [1].

I've also pulled= in Stefan who knows these components very well.


Cheers= ,
Till

On Tue, Apr 23, 2019 at 5:24 PM Ning Shi <ningshi2@gmail.com> wrote:
On Tue, 23 Apr 2019 10:53:= 52 -0400,
Congxian Qiu wrote:
> Sorry for the misleading, in the previous email, I just want to say the= problem is not caused by the UUID generation, it is caused by the different= operators share the same directory(because currentlyFlink uses JobVertx as t= he directory)

Ah, thank you for the clarification, Congxian. That makes sense.

Ning
= --Apple-Mail-487B0B94-5F46-4B04-BB78-367BAE96B61D--