flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Coarse-grained FT implementation
Date Mon, 10 Nov 2014 15:56:19 GMT
Hey everyone!

Sorry to be late to answer to this question.

The short anser is: Our fault tolerance is very comparable to Spark's RDD
lineage. We internally build the computation graph of the operators (we
call it JobGraph / ExecutionGraph) which we use both for execution and
re-execution in case of failures. The subgraph rooted at each operator can
be thought of as the lineage of the result computed by that operator.


The longer answer (with a few more details):

 - An operator is a data source, a function (map/join/reduce/...) or a
built-in operation (aggregate, iteration controller, ...).

 - The JobGraph is a compact version of the program that describes which
operators produce which intermediate results and which ones consume them.

 - The ExecutionGraph is the parallelized version of that graph, that
contains an ExecutionVertex for each parallel instance of an operator. The
ExecutionVertex tracks the state of that parallel instance of the operator.

 - The ExecutionVertex can have multiple ExecutionAttempts. If everything
works fine, there is only one attempt, but attempts can be canceled and new
attempts can be deployed. An execution attempt may trigger other
ExecutionAttempts, if predecessors need to be recomputed.

Stephan




On Fri, Nov 7, 2014 at 7:35 PM, Henry Saputra <henry.saputra@gmail.com>
wrote:

> HI Kostas,
>
> Thanks for the reply, yep you were right it is in current master already.
> But as Marton has mentioned before, I believe there was no
> documentation on how it suppose to work and the git commit comment
> does not have much details on the impl details.
>
> Some questions from the meetup on how to deal with fault in workflow
> process execution, and mostly comparing to Spark RDD lineage
> recomputation.
>
> - Henry
>
> On Fri, Nov 7, 2014 at 10:15 AM, Kostas Tzoumas <ktzoumas@apache.org>
> wrote:
> > Hi Henry,
> >
> > Afaik this is already in the current master, see
> ExecutionGraph.restart().
> >
> > The goal is now to make fault tolerance more fine grained by restarting
> > from checkpointed intermediate data sets, not from the base data.
> >
> > Kostas
> >
> > On Fri, Nov 7, 2014 at 6:49 PM, Henry Saputra <henry.saputra@gmail.com>
> > wrote:
> >
> >> Stephan,
> >>
> >> Could you share your thoughts and design/ plan to implement this new
> >> coarse grained fault tolerant?
> >>
> >> From last talk in Palo Alto seemed some interests about it.
> >>
> >> - Henry
> >>
> >> On Tue, Nov 4, 2014 at 12:14 AM, Márton Balassi
> >> <balassi.marton@gmail.com> wrote:
> >> > Stephan,
> >> >
> >> > Could you please summarize how the new coarse grained FT works? [1]
> >> >
> >> > I'm sure that we'll be facing this question a lot. :)
> >> >
> >> > Thanks,
> >> >
> >> > Marton
> >> >
> >> > [1]
> >> >
> >>
> https://git-wip-us.apache.org/repos/asf?p=incubator-flink.git;a=commit;h=dd687bc6729d9539e05db9761e22a2aadc707341
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message