flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gyula Fóra <gyula.f...@gmail.com>
Subject Re: Force enabling checkpoints for iterative streaming jobs
Date Wed, 10 Jun 2015 12:29:50 GMT
Max suggested that I add this feature slightly hidden to the execution
config instance.

The problem then is that I either make a public field in the config or once
again add a method.

Any ideas?

Aljoscha Krettek <aljoscha@apache.org> ezt írta (időpont: 2015. jún. 10.,
Sze, 14:07):

> Thanks :D, now I see. It makes sense because we don't have another way
> of keeping the cluster state synced/distributed across parallel
> instances of the operators.
>
> On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <gyula.fora@gmail.com> wrote:
> > Here is an example for you:
> >
> > Parallel streaming kmeans, the state we keep is the current cluster
> > centers, and we use iterations to sync the centers across parallel
> > instances.
> > We can afford lost model updated in the loop but we need the checkpoint
> the
> > models.
> >
> >
> https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala
> >
> > (checkpointing is not turned on but you will get the point)
> >
> >
> >
> > Gyula Fóra <gyula.fora@gmail.com> ezt írta (időpont: 2015. jún. 10.,
> Sze,
> > 12:47):
> >
> >> You are right, to have consistent results we would need to persist the
> >> records.
> >>
> >> But since we cannot do that right now, we can still checkpoint all
> >> operator states and understand that inflight records in the loop are
> lost
> >> on failure.
> >>
> >> This is acceptable for most the use-cases that we have developed so far
> >> for iterations (machine learning, graph updates, etc.) What is not
> >> acceptable is to not have checkpointing at all.
> >>
> >> Aljoscha Krettek <aljoscha@apache.org> ezt írta (időpont: 2015. jún.
> 10.,
> >> Sze, 12:43):
> >>
> >>> The elements that are in-flight in an iteration are also state of the
> >>> job. I'm wondering whether the state inside iterations still makes
> >>> sense without these in-flight elements. But I also don't know the King
> >>> use-case, that's why I though an example could be helpful.
> >>>
> >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <gyula.fora@gmail.com>
> >>> wrote:
> >>> > I don't understand the question, I vote for checkpointing all state
> in
> >>> the
> >>> > job, even inside iterations (its more of a loop).
> >>> >
> >>> > Aljoscha Krettek <aljoscha@apache.org> ezt írta (időpont: 2015.
jún.
> >>> 10.,
> >>> > Sze, 12:34):
> >>> >
> >>> >> I don't understand why having the state inside an iteration but
not
> >>> >> the elements that correspond to this state or created this state
is
> >>> >> desirable. Maybe an example could help understand this better?
> >>> >>
> >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <gyula.fora@gmail.com>
> >>> wrote:
> >>> >> > The other tests verify that the checkpointing algorithm runs
> >>> properly.
> >>> >> That
> >>> >> > also ensures that it runs for iterations because a loop is
just an
> >>> extra
> >>> >> > source and sink in the jobgraph (so it is the same for the
> >>> algorithm).
> >>> >> >
> >>> >> > Fabian Hueske <fhueske@gmail.com> ezt írta (időpont:
2015. jún.
> 10.,
> >>> >> Sze,
> >>> >> > 11:19):
> >>> >> >
> >>> >> >> Without going into the details, how well tested is this
feature?
> >>> The PR
> >>> >> >> only extends one test by a few lines.
> >>> >> >>
> >>> >> >> Is that really enough to ensure that
> >>> >> >> 1) the change does not cause trouble
> >>> >> >> 2) is working as expected
> >>> >> >>
> >>> >> >> If this feature should go into the release, it must be
thoroughly
> >>> >> checked
> >>> >> >> and we must take the time for that.
> >>> >> >> Including code and hoping for the best because time is
scarce is
> >>> not an
> >>> >> >> option IMO.
> >>> >> >>
> >>> >> >> Fabian
> >>> >> >>
> >>> >> >>
> >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <gyula.fora@gmail.com>:
> >>> >> >>
> >>> >> >> > And also I would like to remind everyone that any
fault
> tolerance
> >>> we
> >>> >> >> > provide is only as good as the fault tolerance of
the master
> node.
> >>> >> Which
> >>> >> >> is
> >>> >> >> > non existent at the moment.
> >>> >> >> >
> >>> >> >> > So I don't see a reason why a user should not be
able to choose
> >>> >> whether
> >>> >> >> he
> >>> >> >> > wants state checkpoints for iterations as well.
> >>> >> >> >
> >>> >> >> > In any case this will be used by King for instance,
so making
> it
> >>> part
> >>> >> of
> >>> >> >> > the release would save a lot of work for everyone.
> >>> >> >> >
> >>> >> >> > Paris Carbone <parisc@kth.se> ezt írta (időpont:
2015. jún.
> 10.,
> >>> Sze,
> >>> >> >> > 10:29):
> >>> >> >> >
> >>> >> >> > >
> >>> >> >> > > To continue Gyula's point, for consistent snapshots
we need
> to
> >>> >> persist
> >>> >> >> > the
> >>> >> >> > > records in transit within the loop  and also
slightly change
> the
> >>> >> >> current
> >>> >> >> > > protocol since it works only for DAGs. Before
going into that
> >>> >> direction
> >>> >> >> > > though I would propose we first see whether
there is a nice
> way
> >>> to
> >>> >> make
> >>> >> >> > > iterations more structured.
> >>> >> >> > >
> >>> >> >> > > Paris
> >>> >> >> > > ________________________________________
> >>> >> >> > > From: Gyula Fóra <gyula.fora@gmail.com>
> >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM
> >>> >> >> > > To: dev@flink.apache.org
> >>> >> >> > > Subject: Re: Force enabling checkpoints for
iterative
> streaming
> >>> jobs
> >>> >> >> > >
> >>> >> >> > > I disagree. Not having checkpointed operators
inside the
> >>> iteration
> >>> >> >> still
> >>> >> >> > > breaks the guarantees.
> >>> >> >> > >
> >>> >> >> > > It is not about the states it is about the loop
itself.
> >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek
<
> >>> >> aljoscha@apache.org
> >>> >> >> >
> >>> >> >> > > wrote:
> >>> >> >> > >
> >>> >> >> > > > This is the answer I gave on the PR (we
should have one
> place
> >>> for
> >>> >> >> > > > discussing this, though):
> >>> >> >> > > >
> >>> >> >> > > > I would be against merging this in the
current form. What I
> >>> >> propose
> >>> >> >> is
> >>> >> >> > > > to analyse the topology to verify that
there are no
> >>> checkpointed
> >>> >> >> > > > operators inside iterations. Operators
before and after
> >>> iterations
> >>> >> >> can
> >>> >> >> > > > be checkpointed and we can safely allow
the user to enable
> >>> >> >> > > > checkpointing.
> >>> >> >> > > >
> >>> >> >> > > > If we have the code to analyse which operators
are inside
> >>> >> iterations
> >>> >> >> > > > we could also disallow windows inside iterations.
I think
> >>> windows
> >>> >> >> > > > inside iterations don't make sense since
elements in
> different
> >>> >> >> > > > "iterations" would end up in the same window.
Maybe I'm
> wrong
> >>> here
> >>> >> >> > > > though, then please correct me.
> >>> >> >> > > >
> >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton
Balassi
> >>> >> >> > > > <balassi.marton@gmail.com> wrote:
> >>> >> >> > > > > I agree that for the sake of the above
mentioned use
> cases
> >>> it is
> >>> >> >> > > > reasonable
> >>> >> >> > > > > to add this to the release with the
right documentation,
> for
> >>> >> >> machine
> >>> >> >> > > > > learning potentially loosing one round
of feedback data
> >>> should
> >>> >> not
> >>> >> >> > > > matter.
> >>> >> >> > > > >
> >>> >> >> > > > > Let us not block prominent users until
the next release
> on
> >>> this.
> >>> >> >> > > > >
> >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula
Fóra <
> >>> >> gyula.fora@gmail.com>
> >>> >> >> > > > wrote:
> >>> >> >> > > > >
> >>> >> >> > > > >> As for people currently suffering
from it:
> >>> >> >> > > > >>
> >>> >> >> > > > >> An application King is developing
requires iterations,
> and
> >>> they
> >>> >> >> need
> >>> >> >> > > > >> checkpoints. Practically all SAMOA
programs would need
> >>> this.
> >>> >> >> > > > >>
> >>> >> >> > > > >> It is very likely that the state
interfaces will be
> changed
> >>> >> after
> >>> >> >> > the
> >>> >> >> > > > >> release, so this is not something
that we can just add
> >>> later. I
> >>> >> >> > don't
> >>> >> >> > > > see a
> >>> >> >> > > > >> reason why we should not add it,
as it is clearly
> >>> documented.
> >>> >> In
> >>> >> >> > this
> >>> >> >> > > > >> actual case not having guarantees
at all means people
> will
> >>> >> never
> >>> >> >> use
> >>> >> >> > > it
> >>> >> >> > > > in
> >>> >> >> > > > >> any production system. Having
limited guarantees means
> >>> that it
> >>> >> >> will
> >>> >> >> > > > depend
> >>> >> >> > > > >> on the application.
> >>> >> >> > > > >>
> >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53
AM, Ufuk Celebi <
> >>> uce@apache.org>
> >>> >> >> > wrote:
> >>> >> >> > > > >>
> >>> >> >> > > > >> > Hey Gyula,
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > I understand your reasoning,
but I don't think its
> worth
> >>> to
> >>> >> rush
> >>> >> >> > > this
> >>> >> >> > > > >> into
> >>> >> >> > > > >> > the release.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > As you've said, we cannot
give precise guarantees. But
> >>> this
> >>> >> is
> >>> >> >> > > > arguably
> >>> >> >> > > > >> > one of the key requirements
for any fault tolerance
> >>> >> mechanism.
> >>> >> >> > > > Therefore
> >>> >> >> > > > >> I
> >>> >> >> > > > >> > disagree that this is better
than not having anything
> at
> >>> >> all. I
> >>> >> >> > > think
> >>> >> >> > > > it
> >>> >> >> > > > >> > will already go a long way
to have the non-iterative
> case
> >>> >> >> working
> >>> >> >> > > > >> reliably.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > And as far as I know there
are no users really
> suffering
> >>> from
> >>> >> >> this
> >>> >> >> > > at
> >>> >> >> > > > the
> >>> >> >> > > > >> > moment (in the sense that
someone has complained on
> the
> >>> >> mailing
> >>> >> >> > > list).
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > Hence, I vote to postpone
this.
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > – Ufuk
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > On 10 Jun 2015, at 00:19,
Gyula Fóra <
> gyfora@apache.org>
> >>> >> wrote:
> >>> >> >> > > > >> >
> >>> >> >> > > > >> > > Hey all,
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > It is currently impossible
to enable state
> >>> checkpointing
> >>> >> for
> >>> >> >> > > > iterative
> >>> >> >> > > > >> > > jobs, because en exception
is thrown when creating
> the
> >>> >> >> jobgraph.
> >>> >> >> > > > This
> >>> >> >> > > > >> > > behaviour is motivated
by the lack of precise
> >>> guarantees
> >>> >> that
> >>> >> >> we
> >>> >> >> > > can
> >>> >> >> > > > >> give
> >>> >> >> > > > >> > > with the current fault-tolerance
implementations for
> >>> cyclic
> >>> >> >> > > graphs.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812>
> >>> adds an
> >>> >> >> > > optional
> >>> >> >> > > > >> > flag to
> >>> >> >> > > > >> > > force checkpoints even
in case of iterations. The
> >>> algorithm
> >>> >> >> will
> >>> >> >> > > > take
> >>> >> >> > > > >> > > checkpoints periodically
as before, but records in
> >>> transit
> >>> >> >> > inside
> >>> >> >> > > > the
> >>> >> >> > > > >> > loop
> >>> >> >> > > > >> > > will be lost.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > However even this guarantee
is enough for most
> >>> applications
> >>> >> >> > > (Machine
> >>> >> >> > > > >> > > Learning for instance)
and certainly much better
> than
> >>> not
> >>> >> >> having
> >>> >> >> > > > >> anything
> >>> >> >> > > > >> > > at all.
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > I suggest we add this
to the 0.9 release as
> currently
> >>> many
> >>> >> >> > > > applications
> >>> >> >> > > > >> > > suffer from this limitation
(SAMOA, ML pipelines,
> graph
> >>> >> >> > streaming
> >>> >> >> > > > etc.)
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > Cheers,
> >>> >> >> > > > >> > >
> >>> >> >> > > > >> > > Gyula
> >>> >> >> > > > >> >
> >>> >> >> > > > >> >
> >>> >> >> > > > >>
> >>> >> >> > > >
> >>> >> >> > >
> >>> >> >> >
> >>> >> >>
> >>> >>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message