flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andra Lungu <lungu.an...@gmail.com>
Subject Re: [Proposal] Addition to Gelly
Date Tue, 11 Aug 2015 16:33:10 GMT
Hi Samia,

A good method to statistically determine skewed vertices was beyond the
purpose of my thesis. Unfortunately, the statistical methods that fit a
power law distribution don't do a good job. So what I do is that I plot the
degree distribution and then visually determine the threshold. That is also
how other state of the art systems, such as PowerLyra, fix the issue.

A way to automate this would be a nice future addition :).


On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <samk.3211@gmail.com> wrote:

> Dear Andra,
>
> The idea seems pretty nice.
>
> I wonder how you decide the threshold to separate the high degree vertices
> from the low degree vertices.
>
> Regards,
> Samia
>
> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.andra@gmail.com>
> wrote:
>
> > Hi Paris,
> >
> > Nice to virtually meet you too :)
> >
> > Maybe it makes sense to share my freshest chart:
> >
> >
> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing
> >
> > This is for the Community Detection algorithm [1] in which you basically
> > find communities by continuously rescoring vertices. In a vertex-centric
> > scenario (for a skewed graph, in this case, the Twitter follower
> network),
> > since all vertices need to sync after an iteration superstep, they will
> > remain idle until a high degree vertex terminates (computing an aggregate
> > over my followers takes significantly less time than computing one over
> > Obama's followers).
> >
> > What you have on the right is the time elapsed in the regular vertex
> > centric Community Detection (more than two hours).
> > And since we split the node in aggregation trees, in the left there is a
> > chart with the various levels. Level 3 produces the best result.
> >
> > It's a bit cumbersome to put months of work in a mail, I am still
> polishing
> > the documentation, but feel free to ask questions if this is still
> unclear.
> >
> >
> > All in all, the technique is best for linear algorithms that work just in
> > Pregel, for a skewed graph. On a high level, it's a generalisation of
> > PowerGraph's mirroring, that can work on any system, and that does not
> > restrict itself to level = 1 (like PowerGraph does) which in the case of
> > out chart there did not bring us too far.
> >
> > Hope that clarified things a bit more!
> > Andra
> >
> > [1] http://arxiv.org/pdf/0808.2633.pdf
> >
> >
> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <parisc@kth.se> wrote:
> >
> > > Hi Andra and nice to meet you btw :)
> > >
> > > It sounds like very fancy way to deal with skew, I like the idea even
> > > though I am not a graph analytics expert.
> > > Have you ran any experiments or benchmarks to see when this preferable
> ?
> > > Users should be aware when they will get benefits by using it since
> node
> > > splitting doesn’t come with no cost I guess.
> > > I am really eager to see how this will evolve, I think it’s good
> effort.
> > >
> > > cheers
> > > Paris
> > >
> > >
> > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.andra@gmail.com> wrote:
> > > >
> > > > Hi Vasia,
> > > >
> > > > I shall polish the functions a bit, but this is more or less what I
> had
> > > in
> > > > mind:
> > > > GSA Jaccard [what we have in Gelly right now]:
> > > >
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java
> > > > The same version with node split:
> > > >
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java
> > > >
> > > > Yes, sure I can also share some charts with the results. I could say
> > for
> > > > which data sets (power law) and types of algorithms this can be used.
> > If
> > > > it's used appropriately, the overhead is 0.
> > > >
> > > > I'm open to suggestions!
> > > > Thanks!
> > > > Andra
> > > >
> > > >
> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <
> > > vasilikikalavri@gmail.com
> > > >> wrote:
> > > >
> > > >> Hi Andra,
> > > >>
> > > >> thanks for offering to add this work to Gelly and for starting the
> > > >> discussion!
> > > >>
> > > >> How do you think this would look like from an API point of view? Is
> it
> > > easy
> > > >> to make it transparent to the application? Could you give us a
> simple
> > > >> example of what you have in mind?
> > > >>
> > > >> Apart from usability, we should also consider documenting when to
> use
> > > this
> > > >> feature, i.e. for which algorithms, what kind degree distribution
> and
> > > what
> > > >> overhead to expect (if any).
> > > >>
> > > >> Cheers,
> > > >> Vasia.
> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.andra@gmail.com>
> > wrote:
> > > >>
> > > >>> Hey,
> > > >>>
> > > >>> Before actually opening a PR, I wanted to hear your opinion. So,
> here
> > > >> goes
> > > >>> nothing :).
> > > >>>
> > > >>> I'd like to add the core of my master thesis to Gelly. That is,
a
> > > series
> > > >> of
> > > >>> operators that take a skewed graph, split its high degree vertices
> > into
> > > >>> subvertices and redistribute the edges accordingly (thus producing
> > the
> > > >>> result faster due to the increased parallelism); then merge the
> > > >> subvertices
> > > >>> into the initial vertex.
> > > >>>
> > > >>> Here's part of my -unrevised- abstract:
> > > >>> "The Twitter follower graph, the citation network and general
> purpose
> > > >>> social networks follow a power law degree distribution. These
> highly
> > > >> skewed
> > > >>> graphs raise challenges to current computation models which
> uniformly
> > > >>> process vertices, leading to communication overhead or large
> > execution
> > > >> time
> > > >>> stalls.
> > > >>>
> > > >>> In spite of efforts made to scale up computation on natural graphs
> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance
> > > problems
> > > >>> raise from a subset of vertices that have high degrees. In this
> > thesis,
> > > >> we
> > > >>> outline the main processing issues that arise in current graph
> > systems.
> > > >> We
> > > >>> then propose a novel node splitting technique meant to
> differentiate
> > > the
> > > >>> computation as well as partitioning strategies for low and high
> > degree
> > > >>> nodes.
> > > >>>
> > > >>> The “Node Splitting” abstraction is implemented as a separate
set
> of
> > > >>> operators that can be seamlessly combined with current
> > degree-oblivious
> > > >>> models such as Pregel and PowerGraph, but also with degree-aware
> > > engines
> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal
> > overhead
> > > >> in
> > > >>> the absence of skew and improves the naive implementation of a
> subset
> > > of
> > > >>> computational intensive algorithms by a factor of two.
> > > >>>
> > > >>> These results have been proven on a theoretical and practical
level
> > on
> > > >> both
> > > >>> real world and synthetic graphs."
> > > >>>
> > > >>> What I would add:
> > > >>> 1) the operators per se in a separate package
> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and
> > apply
> > > >>> these operators on it)
> > > >>> 3). tests
> > > >>> 4). a separate section in the gelly docs that explains how to
> modify
> > > your
> > > >>> code to use this solution.
> > > >>>
> > > >>> Needless to say I will maintain the code and, if, as we get more
> > users,
> > > >>> they deem this sub-API useless, we can always deprecate and delete
> it
> > > :).
> > > >>>
> > > >>> So, what do you think?
> > > >>> Andra
> > > >>>
> > > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message