flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gyula Fóra <gyula.f...@gmail.com>
Subject Re: Force enabling checkpoints for iterative streaming jobs
Date Wed, 10 Jun 2015 09:27:37 GMT
The other tests verify that the checkpointing algorithm runs properly. That
also ensures that it runs for iterations because a loop is just an extra
source and sink in the jobgraph (so it is the same for the algorithm).

Fabian Hueske <fhueske@gmail.com> ezt írta (időpont: 2015. jún. 10., Sze,
11:19):

> Without going into the details, how well tested is this feature? The PR
> only extends one test by a few lines.
>
> Is that really enough to ensure that
> 1) the change does not cause trouble
> 2) is working as expected
>
> If this feature should go into the release, it must be thoroughly checked
> and we must take the time for that.
> Including code and hoping for the best because time is scarce is not an
> option IMO.
>
> Fabian
>
>
> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.fora@gmail.com>:
>
> > And also I would like to remind everyone that any fault tolerance we
> > provide is only as good as the fault tolerance of the master node. Which
> is
> > non existent at the moment.
> >
> > So I don't see a reason why a user should not be able to choose whether
> he
> > wants state checkpoints for iterations as well.
> >
> > In any case this will be used by King for instance, so making it part of
> > the release would save a lot of work for everyone.
> >
> > Paris Carbone <parisc@kth.se> ezt írta (időpont: 2015. jún. 10., Sze,
> > 10:29):
> >
> > >
> > > To continue Gyula's point, for consistent snapshots we need to persist
> > the
> > > records in transit within the loop  and also slightly change the
> current
> > > protocol since it works only for DAGs. Before going into that direction
> > > though I would propose we first see whether there is a nice way to make
> > > iterations more structured.
> > >
> > > Paris
> > > ________________________________________
> > > From: Gyula Fóra <gyula.fora@gmail.com>
> > > Sent: Wednesday, June 10, 2015 10:19 AM
> > > To: dev@flink.apache.org
> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs
> > >
> > > I disagree. Not having checkpointed operators inside the iteration
> still
> > > breaks the guarantees.
> > >
> > > It is not about the states it is about the loop itself.
> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <aljoscha@apache.org
> >
> > > wrote:
> > >
> > > > This is the answer I gave on the PR (we should have one place for
> > > > discussing this, though):
> > > >
> > > > I would be against merging this in the current form. What I propose
> is
> > > > to analyse the topology to verify that there are no checkpointed
> > > > operators inside iterations. Operators before and after iterations
> can
> > > > be checkpointed and we can safely allow the user to enable
> > > > checkpointing.
> > > >
> > > > If we have the code to analyse which operators are inside iterations
> > > > we could also disallow windows inside iterations. I think windows
> > > > inside iterations don't make sense since elements in different
> > > > "iterations" would end up in the same window. Maybe I'm wrong here
> > > > though, then please correct me.
> > > >
> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi
> > > > <balassi.marton@gmail.com> wrote:
> > > > > I agree that for the sake of the above mentioned use cases it is
> > > > reasonable
> > > > > to add this to the release with the right documentation, for
> machine
> > > > > learning potentially loosing one round of feedback data should not
> > > > matter.
> > > > >
> > > > > Let us not block prominent users until the next release on this.
> > > > >
> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <gyula.fora@gmail.com>
> > > > wrote:
> > > > >
> > > > >> As for people currently suffering from it:
> > > > >>
> > > > >> An application King is developing requires iterations, and they
> need
> > > > >> checkpoints. Practically all SAMOA programs would need this.
> > > > >>
> > > > >> It is very likely that the state interfaces will be changed after
> > the
> > > > >> release, so this is not something that we can just add later.
I
> > don't
> > > > see a
> > > > >> reason why we should not add it, as it is clearly documented.
In
> > this
> > > > >> actual case not having guarantees at all means people will never
> use
> > > it
> > > > in
> > > > >> any production system. Having limited guarantees means that it
> will
> > > > depend
> > > > >> on the application.
> > > > >>
> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <uce@apache.org>
> > wrote:
> > > > >>
> > > > >> > Hey Gyula,
> > > > >> >
> > > > >> > I understand your reasoning, but I don't think its worth
to rush
> > > this
> > > > >> into
> > > > >> > the release.
> > > > >> >
> > > > >> > As you've said, we cannot give precise guarantees. But this
is
> > > > arguably
> > > > >> > one of the key requirements for any fault tolerance mechanism.
> > > > Therefore
> > > > >> I
> > > > >> > disagree that this is better than not having anything at
all. I
> > > think
> > > > it
> > > > >> > will already go a long way to have the non-iterative case
> working
> > > > >> reliably.
> > > > >> >
> > > > >> > And as far as I know there are no users really suffering
from
> this
> > > at
> > > > the
> > > > >> > moment (in the sense that someone has complained on the
mailing
> > > list).
> > > > >> >
> > > > >> > Hence, I vote to postpone this.
> > > > >> >
> > > > >> > – Ufuk
> > > > >> >
> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <gyfora@apache.org>
wrote:
> > > > >> >
> > > > >> > > Hey all,
> > > > >> > >
> > > > >> > > It is currently impossible to enable state checkpointing
for
> > > > iterative
> > > > >> > > jobs, because en exception is thrown when creating
the
> jobgraph.
> > > > This
> > > > >> > > behaviour is motivated by the lack of precise guarantees
that
> we
> > > can
> > > > >> give
> > > > >> > > with the current fault-tolerance implementations for
cyclic
> > > graphs.
> > > > >> > >
> > > > >> > > This PR <https://github.com/apache/flink/pull/812>
adds an
> > > optional
> > > > >> > flag to
> > > > >> > > force checkpoints even in case of iterations. The algorithm
> will
> > > > take
> > > > >> > > checkpoints periodically as before, but records in
transit
> > inside
> > > > the
> > > > >> > loop
> > > > >> > > will be lost.
> > > > >> > >
> > > > >> > > However even this guarantee is enough for most applications
> > > (Machine
> > > > >> > > Learning for instance) and certainly much better than
not
> having
> > > > >> anything
> > > > >> > > at all.
> > > > >> > >
> > > > >> > >
> > > > >> > > I suggest we add this to the 0.9 release as currently
many
> > > > applications
> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph
> > streaming
> > > > etc.)
> > > > >> > >
> > > > >> > >
> > > > >> > > Cheers,
> > > > >> > >
> > > > >> > > Gyula
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message