flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: [DISCUSS] Gelly iteration abstractions
Date Mon, 23 Feb 2015 17:05:59 GMT
I see, thanks a lot for the answers!
To rephrase my original question, would it make sense to define a
closed-loop iteration where the state is the whole graph?

If you want to take a look at the current implementation of DMST using
delta iteration, Andra has made a PR [1].
On a high-level, this algorithm does (more or less) the following:

while there are more than one vertices in the graph
   for every vertex
      select the min-weight edge
      add all selected edges to the MST
   collapse vertices with the same root into one vertex (iterative
connected components-like step)

What we are thinking is that we could express this with -maybe- a bulk-like
iteration on the Graph, inside which we can use Gelly methods.
Would it make sense or shall we go for a for-loop implementation instead?

Thanks!
-V.

[1]: http://github.com/apache/flink/pull/434

On 23 February 2015 at 16:18, Stephan Ewen <sewen@apache.org> wrote:

> Closed-loop iterations are much more efficient right now. Long for loops
> suffer from memory fragmentation (an issue that is in the list to fix).
>
> Also, closed loops can be stateful (delta iterations) and do not require
> task re-deployment.
>
> On Mon, Feb 23, 2015 at 4:15 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> > wrote:
>
> > I see that's cool :-)
> > So, what is the advantage of closed-loop versus for-loop iterations?
> > Custom convergence criteria / aggregators and more efficient execution
> > plans?
> >
> > On 23 February 2015 at 15:01, Stephan Ewen <sewen@apache.org> wrote:
> >
> > > For loops are basically rolled out - they yield long execution plans.
> > >
> > > On Mon, Feb 23, 2015 at 2:44 PM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > > wrote:
> > >
> > > > for-loop iterations could cover some cases, I guess, when the number
> of
> > > > iterations is known beforehand.
> > > > Are there currently any restrictions on what can be used inside a
> > > for-loop?
> > > > How are they translated into execution plans?
> > > >
> > > > On 23 February 2015 at 13:08, Stephan Ewen <sewen@apache.org> wrote:
> > > >
> > > > > Some things may not work well as "closed-loop" iterations.
> > > > >
> > > > > Is it possible to express those as for-loop iterations?
> > > > >
> > > > > On Mon, Feb 23, 2015 at 1:03 PM, Vasiliki Kalavri <
> > > > > vasilikikalavri@gmail.com
> > > > > > wrote:
> > > > >
> > > > > > Hi Stephan,
> > > > > >
> > > > > > yes, this would work for the cases where an algorithm only
> updates
> > > the
> > > > > > vertex values or only updates the edge values.
> > > > > >
> > > > > > What we would like to also support is
> > > > > > (a) algorithms where both vertices and edges are updated in
one
> > > > iteration
> > > > > > (b) algorithms where the graph structure changes from one
> iteration
> > > to
> > > > > the
> > > > > > next and
> > > > > > (c) branching inside an iteration, i.e. executing a different
> > > > "iteration
> > > > > > body" based on some condition.
> > > > > >
> > > > > > We can still currently implement those with regular Flink
> iteration
> > > > > > operators, but the resulting code is not that nice or efficient.
> > > > > > For example, if we want to update both edges and vertex values,
> we
> > > can
> > > > > > still create a solution set where the vertex values are attached
> to
> > > > each
> > > > > > edge.
> > > > > > Regarding branching inside an iteration, we can use an aggregator
> > > that
> > > > > > tracks the iteration phase, but then we need to somehow make
the
> > > > > different
> > > > > > phases to consist of the same operators and also check the
> > branching
> > > > > > condition inside each UDF.
> > > > > >
> > > > > > Cheers,
> > > > > > V.
> > > > > >
> > > > > >
> > > > > > On 23 February 2015 at 11:05, Stephan Ewen <sewen@apache.org>
> > wrote:
> > > > > >
> > > > > > > As a workaround, it should always work to get the Edge
and
> Vertex
> > > > data
> > > > > > set
> > > > > > > from the graph and use the regular Fink iteration operators?
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 22, 2015 at 4:53 PM, Vasiliki Kalavri <
> > > > > > > vasilikikalavri@gmail.com
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > yes, I was referring to the parallel Boruvka algorithm.
There
> > are
> > > > > > several
> > > > > > > > ways to implement this one in Flink and I believe
that the
> one
> > > > > > described
> > > > > > > in
> > > > > > > > the paper (vertex-centric) is not the most elegant
one :)
> > > > > > > >
> > > > > > > > Andra is now working on an idea that uses the delta
iteration
> > > > > > abstraction
> > > > > > > > and we believe that it will be both more efficient
and easier
> > to
> > > > > > > > understand. It has the edges in the solution set and
the
> > vertices
> > > > in
> > > > > > the
> > > > > > > > workset, so it follows the pattern I describe in (2)
in my
> > > previous
> > > > > > > e-mail.
> > > > > > > > As a next step, we would like to see how having an
iteration
> > > > operator
> > > > > > > that
> > > > > > > > could update the whole graph -what I describe as (3)-
would
> > make
> > > > this
> > > > > > > even
> > > > > > > > nicer.
> > > > > > > >
> > > > > > > > Any ideas are highly welcome!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > V.
> > > > > > > >
> > > > > > > > On 22 February 2015 at 16:32, Andra Lungu <
> > lungu.andra@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Alex,
> > > > > > > > >
> > > > > > > > > Vasia is talking about the second version(presented
Friday)
> > of
> > > > > > Parallel
> > > > > > > > > Boruvka, which can be found here:
> > > > > > > > > https://github.com/TU-Berlin-DIMA/IMPRO-3.WS14/pull/59
> > > > > > > > >
> > > > > > > > > I will propose the third, non-Pregel like approach
directly
> > to
> > > > > Gelly
> > > > > > > > soon.
> > > > > > > > >
> > > > > > > > > If you have additional questions, I will be happy
to answer
> > > them.
> > > > > > > > >
> > > > > > > > > Andra
> > > > > > > > >
> > > > > > > > > On Sun, Feb 22, 2015 at 4:23 PM, Alexander Alexandrov
<
> > > > > > > > > alexander.s.alexandrov@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Vasia,
> > > > > > > > > >
> > > > > > > > > > I am trying to look at the problem in more
detail. Which
> > > > version
> > > > > of
> > > > > > > the
> > > > > > > > > MST
> > > > > > > > > > are you talking about?
> > > > > > > > > >
> > > > > > > > > > Right now in the Gelly repository I can
only find the
> SSSP
> > > > > example
> > > > > > > > > > (parallel Bellman-Ford) from Section 4.2
in [1].
> > > > > > > > > >
> > > > > > > > > > However, it seems that the issues encountered
by Andra
> are
> > > > > related
> > > > > > to
> > > > > > > > the
> > > > > > > > > > implementation of Parallel Boruvka (Section
3.2 in [2]).
> Is
> > > > that
> > > > > > > > correct?
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > A.
> > > > > > > > > >
> > > > > > > > > > [1] http://www.vldb.org/pvldb/vol7/p1047-han.pdf
> > > > > > > > > > [2] http://www.vldb.org/pvldb/vol7/p577-salihoglu.pdf
> > > > > > > > > >
> > > > > > > > > > 2015-02-19 21:03 GMT+01:00 Vasiliki Kalavri
<
> > > > > > > vasilikikalavri@gmail.com
> > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hello beautiful Flink people,
> > > > > > > > > > >
> > > > > > > > > > > during the past few days, Andra and
I have been
> > discussing
> > > > > about
> > > > > > > how
> > > > > > > > to
> > > > > > > > > > > extend Gelly's iteration methods.
> > > > > > > > > > >
> > > > > > > > > > > Alexander's course (and his awesome
students) has made
> it
> > > > > obvious
> > > > > > > > that
> > > > > > > > > > > vertex-centric iterations are not the
best fit for
> > > algorithms
> > > > > > which
> > > > > > > > > don't
> > > > > > > > > > > follow the common "propagate-update"
pattern. For
> > example,
> > > > > Andra
> > > > > > is
> > > > > > > > > > working
> > > > > > > > > > > on an implementation of Minimum Spanning
Tree, which
> > > requires
> > > > > > > > branching
> > > > > > > > > > > inside an iteration and also requires
a convergence
> check
> > > of
> > > > an
> > > > > > > > > internal
> > > > > > > > > > > iteration. Others also reported similar
issues [1, 2].
> > > Trying
> > > > > to
> > > > > > > fit
> > > > > > > > > such
> > > > > > > > > > > algorithms to the vertex-centric model
leads to long
> and
> > > ugly
> > > > > > code,
> > > > > > > > > e.g.
> > > > > > > > > > > aggregators to keep track of algorithm
phases,
> > duplicating
> > > > > data,
> > > > > > > etc.
> > > > > > > > > > >
> > > > > > > > > > > One limitation of the vertex-centric
and the upcoming
> GAS
> > > > model
> > > > > > is
> > > > > > > > that
> > > > > > > > > > > they both only allow the vertex values
to be updated in
> > > each
> > > > > > > > iteration.
> > > > > > > > > > > However, for some algorithms we need
to update the edge
> > > > values
> > > > > > and
> > > > > > > in
> > > > > > > > > > > others we need to update both. In even
more complex
> > > > situations
> > > > > > > (like
> > > > > > > > > > > Andra's MST) in some iterations we
need to update the
> > > vertex
> > > > > > values
> > > > > > > > and
> > > > > > > > > > in
> > > > > > > > > > > some iterations we need to update the
edge values.
> > > > > > > > > > > Another problem is that we currently
don't have a way
> to
> > > > allow
> > > > > > > > > different
> > > > > > > > > > > computational phases inside an iteration.
This is
> > something
> > > > > that
> > > > > > > > Giraph
> > > > > > > > > > > solves with master compute, a function
that is executed
> > > once
> > > > > > before
> > > > > > > > > each
> > > > > > > > > > > superstep and sets the computation
function.
> > > > > > > > > > >
> > > > > > > > > > > All that said, I believe that we can
solve most of
> these
> > > > issues
> > > > > > if
> > > > > > > we
> > > > > > > > > > > nicely expose Flink's iteration operators
in Gelly. I
> can
> > > see
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > cases:
> > > > > > > > > > >
> > > > > > > > > > > 1. Bulk & delta iterations where
the solution set is
> the
> > > > vertex
> > > > > > > > > dataset:
> > > > > > > > > > > this will be similar to vertex-centric
and GAS, but
> will
> > > > allow
> > > > > > more
> > > > > > > > > > > flexible dataflows inside the iteration.
> > > > > > > > > > > 2. Bulk & delta iterations where
the solution set is
> the
> > > edge
> > > > > > > > dataset:
> > > > > > > > > > for
> > > > > > > > > > > the cases where we need to update edge
values.
> > > > > > > > > > > 3. Bulk & delta iterations where
the solution set is
> the
> > > > Graph:
> > > > > > > this
> > > > > > > > > will
> > > > > > > > > > > cover more complex cases, where the
algorithm updates
> > both
> > > > > > vertices
> > > > > > > > and
> > > > > > > > > > > edges or even adds/removes vertices/edges,
i.e. updates
> > the
> > > > > whole
> > > > > > > > > Graph.
> > > > > > > > > > >
> > > > > > > > > > > What do you think? I can see 1 &
2 being very easy to
> > > > > implement,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > > suspect 3 won't be that easy (but so
awesome to have
> ^^).
> > > > > > > > > > > Would it work the way a Graph is represented
now, i.e.
> > > with 2
> > > > > > > > DataSets?
> > > > > > > > > > >
> > > > > > > > > > > Any comment, idea, pointer would be
much appreciated!
> > Thank
> > > > you
> > > > > > ^^
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > -V.
> > > > > > > > > > >
> > > > > > > > > > > [1]:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Can-a-master-class-control-the-superstep-in-Flink-Spargel-td733.html
> > > > > > > > > > > [2]:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://issues.apache.org/jira/browse/FLINK-1552?focusedCommentId=14325769&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14325769
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message