flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andra Lungu <lungu.an...@gmail.com>
Subject Re: [Proposal] Addition to Gelly
Date Tue, 11 Aug 2015 13:41:38 GMT
Hi Paris,

Nice to virtually meet you too :)

Maybe it makes sense to share my freshest chart:
https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing

This is for the Community Detection algorithm [1] in which you basically
find communities by continuously rescoring vertices. In a vertex-centric
scenario (for a skewed graph, in this case, the Twitter follower network),
since all vertices need to sync after an iteration superstep, they will
remain idle until a high degree vertex terminates (computing an aggregate
over my followers takes significantly less time than computing one over
Obama's followers).

What you have on the right is the time elapsed in the regular vertex
centric Community Detection (more than two hours).
And since we split the node in aggregation trees, in the left there is a
chart with the various levels. Level 3 produces the best result.

It's a bit cumbersome to put months of work in a mail, I am still polishing
the documentation, but feel free to ask questions if this is still unclear.


All in all, the technique is best for linear algorithms that work just in
Pregel, for a skewed graph. On a high level, it's a generalisation of
PowerGraph's mirroring, that can work on any system, and that does not
restrict itself to level = 1 (like PowerGraph does) which in the case of
out chart there did not bring us too far.

Hope that clarified things a bit more!
Andra

[1] http://arxiv.org/pdf/0808.2633.pdf


On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <parisc@kth.se> wrote:

> Hi Andra and nice to meet you btw :)
>
> It sounds like very fancy way to deal with skew, I like the idea even
> though I am not a graph analytics expert.
> Have you ran any experiments or benchmarks to see when this preferable ?
> Users should be aware when they will get benefits by using it since node
> splitting doesn’t come with no cost I guess.
> I am really eager to see how this will evolve, I think it’s good effort.
>
> cheers
> Paris
>
>
> > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.andra@gmail.com> wrote:
> >
> > Hi Vasia,
> >
> > I shall polish the functions a bit, but this is more or less what I had
> in
> > mind:
> > GSA Jaccard [what we have in Gelly right now]:
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java
> > The same version with node split:
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java
> >
> > Yes, sure I can also share some charts with the results. I could say for
> > which data sets (power law) and types of algorithms this can be used. If
> > it's used appropriately, the overhead is 0.
> >
> > I'm open to suggestions!
> > Thanks!
> > Andra
> >
> >
> > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <
> vasilikikalavri@gmail.com
> >> wrote:
> >
> >> Hi Andra,
> >>
> >> thanks for offering to add this work to Gelly and for starting the
> >> discussion!
> >>
> >> How do you think this would look like from an API point of view? Is it
> easy
> >> to make it transparent to the application? Could you give us a simple
> >> example of what you have in mind?
> >>
> >> Apart from usability, we should also consider documenting when to use
> this
> >> feature, i.e. for which algorithms, what kind degree distribution and
> what
> >> overhead to expect (if any).
> >>
> >> Cheers,
> >> Vasia.
> >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.andra@gmail.com> wrote:
> >>
> >>> Hey,
> >>>
> >>> Before actually opening a PR, I wanted to hear your opinion. So, here
> >> goes
> >>> nothing :).
> >>>
> >>> I'd like to add the core of my master thesis to Gelly. That is, a
> series
> >> of
> >>> operators that take a skewed graph, split its high degree vertices into
> >>> subvertices and redistribute the edges accordingly (thus producing the
> >>> result faster due to the increased parallelism); then merge the
> >> subvertices
> >>> into the initial vertex.
> >>>
> >>> Here's part of my -unrevised- abstract:
> >>> "The Twitter follower graph, the citation network and general purpose
> >>> social networks follow a power law degree distribution. These highly
> >> skewed
> >>> graphs raise challenges to current computation models which uniformly
> >>> process vertices, leading to communication overhead or large execution
> >> time
> >>> stalls.
> >>>
> >>> In spite of efforts made to scale up computation on natural graphs
> >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance
> problems
> >>> raise from a subset of vertices that have high degrees. In this thesis,
> >> we
> >>> outline the main processing issues that arise in current graph systems.
> >> We
> >>> then propose a novel node splitting technique meant to differentiate
> the
> >>> computation as well as partitioning strategies for low and high degree
> >>> nodes.
> >>>
> >>> The “Node Splitting” abstraction is implemented as a separate set of
> >>> operators that can be seamlessly combined with current degree-oblivious
> >>> models such as Pregel and PowerGraph, but also with degree-aware
> engines
> >>> such as Pregel+ and PowerLyra. Our solution introduces minimal overhead
> >> in
> >>> the absence of skew and improves the naive implementation of a subset
> of
> >>> computational intensive algorithms by a factor of two.
> >>>
> >>> These results have been proven on a theoretical and practical level on
> >> both
> >>> real world and synthetic graphs."
> >>>
> >>> What I would add:
> >>> 1) the operators per se in a separate package
> >>> 2) a slightly modified example (e.g. take Jaccard Similarity and apply
> >>> these operators on it)
> >>> 3). tests
> >>> 4). a separate section in the gelly docs that explains how to modify
> your
> >>> code to use this solution.
> >>>
> >>> Needless to say I will maintain the code and, if, as we get more users,
> >>> they deem this sub-API useless, we can always deprecate and delete it
> :).
> >>>
> >>> So, what do you think?
> >>> Andra
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message