flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vasia Kalavri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2548) In a VertexCentricIteration, the run time of one iteration should be proportional to the size of the workset
Date Sat, 22 Aug 2015 08:34:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707946#comment-14707946

Vasia Kalavri commented on FLINK-2548:

[~ggevay], an approach similar to what you are describing is implemented by the GatherSumApplyIteration,
i.e. join + reduce + join.
I have also found out in my experiments that this implementation is faster than the vertex-centric
in many cases, but also might be slower, depending on the dataset and algorithm. That's why
we support both models and let the user decide which one to use based on their needs.
I don't think changing the vertex-centric implementation to switch between coGroup and join
is a good idea.
I think that the two supported iteration models in Gelly cover the majority of common use-cases.
For more custom solutions one can always write their own delta iteration.

> In a VertexCentricIteration, the run time of one iteration should be proportional to
the size of the workset
> ------------------------------------------------------------------------------------------------------------
>                 Key: FLINK-2548
>                 URL: https://issues.apache.org/jira/browse/FLINK-2548
>             Project: Flink
>          Issue Type: Improvement
>          Components: Gelly
>    Affects Versions: 0.9, 0.10
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
> Currently, the performance of vertex centric iteration is suboptimal in those iterations
where the workset is small, because the complexity of one iteration contains the number of
edges and vertices of the graph because of coGroups:
> VertexCentricIteration.buildMessagingFunction does a coGroup between the edges and the
workset, to get the neighbors to the messaging UDF. This is problematic from a performance
point of view, because the coGroup UDF gets called on all the edge groups, including those
that are not getting any messages.
> An analogous problem is present in VertexCentricIteration.createResultSimpleVertex at
the creation of the updates: a coGroup happens between the messages and the solution set,
which has the number of vertices of the graph included in its complexity.
> Both of these coGroups could be avoided by doing a join instead (with the same keys that
the coGroup uses), and then a groupBy. The complexity of these operations would be dominated
by the size of the workset, as opposed to the number of edges or vertices of the graph. The
joins should have the edges and the solution set at the build side to achieve this complexity.
(They will not be rebuilt at every iteration.)
> I made some experiments with this, and the initial results seem promising. On some workloads,
this achieves a 2 times speedup, because later iterations often have quite small worksets,
and these get a huge speedup from this.

This message was sent by Atlassian JIRA

View raw message