Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98CCA18895 for ; Wed, 10 Jun 2015 08:08:56 +0000 (UTC) Received: (qmail 3591 invoked by uid 500); 10 Jun 2015 08:08:56 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 3532 invoked by uid 500); 10 Jun 2015 08:08:56 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 3517 invoked by uid 99); 10 Jun 2015 08:08:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 08:08:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of balassi.marton@gmail.com designates 209.85.212.179 as permitted sender) Received: from [209.85.212.179] (HELO mail-wi0-f179.google.com) (209.85.212.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 08:06:41 +0000 Received: by wiwd19 with SMTP id d19so39918873wiw.0 for ; Wed, 10 Jun 2015 01:08:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=p+//KasYlBJF4c/ScWJ+wS4uMKMo+3w553BBCrvVnKg=; b=uh+WkoH5rZfFtn2GDOQMPbUVuu1u2loFR6L1jN8iCdIQ6/GhxTxeg7j51OwJBK0xUf 4NZhJmaThAedgc6ei8a2sS5XFrPRCJ4RkcNPT1/Ugz8zyKneeig5pVQrmxBWWO41rTzZ wROBk4nTzmzWHuD3SRGLOFtzA8k/VoLzcJyp5Z3jKIeIhD+4NxfAHItOrM5bw3f8g5jK hlF6YIsxqGGN4J+VY2tKdKK2zr3TR280jZDa+8URIvcJMM8IvKfX4LyRblNF4/CMn1Sk x+kpkM+pOi09rrDihAv7+FAZHGmZMoR8U/efx6v2RocicOGHUBzUMkF2kp9Prt3pejqZ UQHA== X-Received: by 10.180.149.173 with SMTP id ub13mr5924475wib.23.1433923709981; Wed, 10 Jun 2015 01:08:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.86.165 with HTTP; Wed, 10 Jun 2015 01:08:09 -0700 (PDT) In-Reply-To: References: <5017E5D7-42DC-401F-B89E-4FA48764F002@apache.org> From: =?UTF-8?Q?M=C3=A1rton_Balassi?= Date: Wed, 10 Jun 2015 10:08:09 +0200 Message-ID: Subject: Re: Force enabling checkpoints for iterative streaming jobs To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c385685e58e805182561f6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c385685e58e805182561f6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I agree that for the sake of the above mentioned use cases it is reasonable to add this to the release with the right documentation, for machine learning potentially loosing one round of feedback data should not matter. Let us not block prominent users until the next release on this. On Wed, Jun 10, 2015 at 8:09 AM, Gyula F=C3=B3ra wro= te: > As for people currently suffering from it: > > An application King is developing requires iterations, and they need > checkpoints. Practically all SAMOA programs would need this. > > It is very likely that the state interfaces will be changed after the > release, so this is not something that we can just add later. I don't see= a > reason why we should not add it, as it is clearly documented. In this > actual case not having guarantees at all means people will never use it i= n > any production system. Having limited guarantees means that it will depen= d > on the application. > > On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi wrote: > > > Hey Gyula, > > > > I understand your reasoning, but I don't think its worth to rush this > into > > the release. > > > > As you've said, we cannot give precise guarantees. But this is arguably > > one of the key requirements for any fault tolerance mechanism. Therefor= e > I > > disagree that this is better than not having anything at all. I think i= t > > will already go a long way to have the non-iterative case working > reliably. > > > > And as far as I know there are no users really suffering from this at t= he > > moment (in the sense that someone has complained on the mailing list). > > > > Hence, I vote to postpone this. > > > > =E2=80=93 Ufuk > > > > On 10 Jun 2015, at 00:19, Gyula F=C3=B3ra wrote: > > > > > Hey all, > > > > > > It is currently impossible to enable state checkpointing for iterativ= e > > > jobs, because en exception is thrown when creating the jobgraph. This > > > behaviour is motivated by the lack of precise guarantees that we can > give > > > with the current fault-tolerance implementations for cyclic graphs. > > > > > > This PR adds an optional > > flag to > > > force checkpoints even in case of iterations. The algorithm will take > > > checkpoints periodically as before, but records in transit inside the > > loop > > > will be lost. > > > > > > However even this guarantee is enough for most applications (Machine > > > Learning for instance) and certainly much better than not having > anything > > > at all. > > > > > > > > > I suggest we add this to the 0.9 release as currently many applicatio= ns > > > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc= .) > > > > > > > > > Cheers, > > > > > > Gyula > > > > > --001a11c385685e58e805182561f6--