Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAC27z=M67F2JCxzK7hsnYs6Cr7Nk5RNdbM5t+OHEc21z5pZO3g@mail.gmail.com>
References: <CAELUF_BKok3riwE9j5HCPhJZRntVczdfYGShwa9jqesoMxmNyw@mail.gmail.com>
 <CAAdrtT3X_24LnoA1sUPiFfx9o5E1BW1NoNEfBzL7BrKjeSkwzA@mail.gmail.com> <CAC27z=M67F2JCxzK7hsnYs6Cr7Nk5RNdbM5t+OHEc21z5pZO3g@mail.gmail.com>
From: Flavio Pompermaier <pompermaier@okkam.it>
Date: Thu, 26 Oct 2017 11:19:13 +0200
Message-ID: <CAELUF_CA=np+3xFqo+nOsjMaagFf7fendfm=t6-oZpfGOtoX2g@mail.gmail.com>
Subject: Re: State snapshotting when source is finite
To: Till Rohrmann <trohrmann@apache.org>
Cc: Fabian Hueske <fhueske@gmail.com>, user <user@flink.apache.org>,
	Aljoscha Krettek <aljoscha@apache.org>
Content-Type: multipart/alternative; boundary="001a11417034a4040d055c6fabcd"
archived-at: Thu, 26 Oct 2017 09:19:46 -0000

--001a11417034a4040d055c6fabcd
Content-Type: text/plain; charset="UTF-8"

Done: https://issues.apache.org/jira/browse/FLINK-7930

Best,
Flavio

On Thu, Oct 26, 2017 at 10:52 AM, Till Rohrmann <trohrmann@apache.org>
wrote:

> Hi Flavio,
>
> this kind of feature is indeed useful and currently not supported by
> Flink. I think, however, that this feature is a bit trickier to implement,
> because Tasks cannot currently initiate checkpoints/savepoints on their
> own. This would entail some changes to the lifecycle of a Task and an extra
> communication step with the JobManager. However, nothing impossible to do.
>
> Please open a JIRA issue with the description of the problem where we can
> continue the discussion.
>
> Cheers,
> Till
>
> On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> Thanks for bringing up this topic.
>> I think running periodic jobs with state that gets restored and persisted
>> in a savepoint is a very valid use case and would fit the stream is a
>> superset of batch story quite well.
>> I'm not sure if this behavior is already supported, but think this would
>> be a desirable feature.
>>
>> I'm looping in Till and Aljoscha who might have some thoughts on this as
>> well.
>> Depending on the discussion we should open a JIRA for this feature.
>>
>> Cheers, Fabian
>>
>> 2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> Hi to all,
>>> in my current use case I'd like to improve one step of our batch
>>> pipeline.
>>> There's one specific job that ingest a tabular dataset (of Rows) and
>>> explode it into a set of RDF statements (as Tuples).  The objects we output
>>> are a containers of those Tuples (grouped by a field).
>>> Flink stateful streaming could be a perfect fit here because we
>>> incrementally increase the state of those containers but we don't have to
>>> spend a lot of time performing some GET operation to an external Key-value
>>> store.
>>> The big problem here is that the sources are finite and the state of the
>>> job gets lost once the job ends, while I was expecting that Flink was
>>> snapshotting the state of its operators before exiting.
>>>
>>> This idea was inspired by https://data-artisans.com/b
>>> log/queryable-state-use-case-demo#no-external-store, whit the
>>> difference that one can resume the state of the stateful application only
>>> when required.
>>> Do you think that it could be possible to support such a use case (that
>>> we can summarize as "periodic batch jobs that pick up where they left")?
>>>
>>> Best,
>>> Flavio
>>>
>>
>>
>

--001a11417034a4040d055c6fabcd
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Done:=C2=A0<a href=3D"https://issues.apache.org/jira/brows=
e/FLINK-7930">https://issues.apache.org/jira/browse/FLINK-7930</a><div><br>=
</div><div>Best,</div><div>Flavio</div><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Thu, Oct 26, 2017 at 10:52 AM, Till Rohrmann <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:trohrmann@apache.org" target=3D"_blank">=
trohrmann@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
"><div dir=3D"ltr">Hi Flavio,<div><br></div><div>this kind of feature is in=
deed useful and currently not supported by Flink. I think, however, that th=
is feature is a bit trickier to implement, because Tasks cannot currently i=
nitiate checkpoints/savepoints on their own. This would entail some changes=
 to the lifecycle of a Task and an extra communication step with the JobMan=
ager. However, nothing impossible to do.</div><div><br></div><div>Please op=
en a JIRA issue with the description of the problem where we can continue t=
he discussion.</div><div><br></div><div>Cheers,</div><div>Till</div></div><=
div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Thu, Oct 26, 2017 at 9:58 AM, Fabian Hueske <span =
dir=3D"ltr">&lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blank">fhue=
ske@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr"><div><div><div><div>Hi Flavio,<br></div><div><br></div><div>Tha=
nks for bringing up this topic.<br></div>I think running periodic jobs with=
 state that gets restored and persisted in a savepoint is a very valid use =
case and would fit the stream is a superset of batch story quite well.<br><=
/div>I&#39;m not sure if this behavior is already supported, but think this=
 would be a desirable feature.<br><br></div>I&#39;m looping in Till and Alj=
oscha who might have some thoughts on this as well.</div><div>Depending on =
the discussion we should open a JIRA for this feature.<br></div><div><br></=
div>Cheers, Fabian<br></div><div class=3D"m_8621973248331487399HOEnZb"><div=
 class=3D"m_8621973248331487399h5"><div class=3D"gmail_extra"><br><div clas=
s=3D"gmail_quote">2017-10-25 10:31 GMT+02:00 Flavio Pompermaier <span dir=
=3D"ltr">&lt;<a href=3D"mailto:pompermaier@okkam.it" target=3D"_blank">pomp=
ermaier@okkam.it</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Hi to all,<div>in my current use case I&#39;d like to improve one =
step of our batch pipeline.</div><div>There&#39;s one specific job that ing=
est a tabular dataset (of Rows) and explode it into a set of RDF statements=
 (as Tuples).=C2=A0 The objects we output are a containers of those Tuples =
(grouped by a field).</div><div>Flink stateful streaming could be a perfect=
 fit here because we incrementally increase the state of those containers b=
ut we don&#39;t have to spend a lot of time performing some GET operation t=
o an external Key-value store.=C2=A0</div><div>The big problem here is that=
 the sources are finite and the state of the job gets lost once the job end=
s, while I was expecting that Flink was snapshotting the state of its opera=
tors before exiting.</div><div><br></div><div>This idea was inspired by=C2=
=A0<a href=3D"https://data-artisans.com/blog/queryable-state-use-case-demo#=
no-external-store" target=3D"_blank">https://data-artisans.com/b<wbr>log/qu=
eryable-state-use-case-d<wbr>emo#no-external-store</a>, whit the difference=
 that one can resume the state of the stateful application only when requir=
ed.</div><div>Do you think that it could be possible to support such a use =
case (that we can summarize as &quot;periodic batch jobs that pick up where=
 they left&quot;)?<br clear=3D"all">
</div><div><br></div><div>Best,</div><div>Flavio</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br>
</div></div>

--001a11417034a4040d055c6fabcd--