flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabor Gevay (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-2548) VertexCentricIteration should avoid doing a coGroup with the edges and the solution set
Date Wed, 19 Aug 2015 16:40:46 GMT
Gabor Gevay created FLINK-2548:

             Summary: VertexCentricIteration should avoid doing a coGroup with the edges and
the solution set
                 Key: FLINK-2548
                 URL: https://issues.apache.org/jira/browse/FLINK-2548
             Project: Flink
          Issue Type: Improvement
          Components: Gelly
    Affects Versions: 0.9, 0.10
            Reporter: Gabor Gevay
            Assignee: Gabor Gevay

Currently, the performance of vertex centric iteration is suboptimal in those iterations where
the workset is small, because the complexity of one iteration contains the number of edges
and vertices of the graph because of coGroups:

VertexCentricIteration.buildMessagingFunction does a coGroup between the edges and the workset,
to get the neighbors to the messaging UDF. This is problematic from a performance point of
view, because the coGroup UDF gets called on all the edge groups, including those that are
not getting any messages.

An analogous problem is present in VertexCentricIteration.createResultSimpleVertex at the
creation of the updates: a coGroup happens between the messages and the solution set, which
has the number of vertices of the graph included in its complexity.

Both of these coGroups could be avoided by doing a join instead (with the same keys that the
coGroup uses), and then a groupBy. The complexity of these operations would be dominated by
the size of the workset, as opposed to the number of edges or vertices of the graph. The joins
should have the edges and the solution set at the build side to achieve this complexity. (They
will not be rebuilt at every iteration.)

I made some experiments with this, and the initial results seem promising. On some workloads,
this achieves a 2 times speedup, because later iterations often have quite small worksets,
and these get a huge speedup from this.

This message was sent by Atlassian JIRA

View raw message