I would love to get some feedback from the guys at data Artisans about this
one.
So far, the comments originated and spread in the Stockholm area :)
On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <lungu.andra@gmail.com> wrote:
> Hi Samia,
>
> A good method to statistically determine skewed vertices was beyond the
> purpose of my thesis. Unfortunately, the statistical methods that fit a
> power law distribution don't do a good job. So what I do is that I plot the
> degree distribution and then visually determine the threshold. That is also
> how other state of the art systems, such as PowerLyra, fix the issue.
>
> A way to automate this would be a nice future addition :).
>
>
> On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <samk.3211@gmail.com> wrote:
>
>> Dear Andra,
>>
>> The idea seems pretty nice.
>>
>> I wonder how you decide the threshold to separate the high degree vertices
>> from the low degree vertices.
>>
>> Regards,
>> Samia
>>
>> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.andra@gmail.com>
>> wrote:
>>
>> > Hi Paris,
>> >
>> > Nice to virtually meet you too :)
>> >
>> > Maybe it makes sense to share my freshest chart:
>> >
>> >
>> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing
>> >
>> > This is for the Community Detection algorithm [1] in which you basically
>> > find communities by continuously rescoring vertices. In a vertexcentric
>> > scenario (for a skewed graph, in this case, the Twitter follower
>> network),
>> > since all vertices need to sync after an iteration superstep, they will
>> > remain idle until a high degree vertex terminates (computing an
>> aggregate
>> > over my followers takes significantly less time than computing one over
>> > Obama's followers).
>> >
>> > What you have on the right is the time elapsed in the regular vertex
>> > centric Community Detection (more than two hours).
>> > And since we split the node in aggregation trees, in the left there is a
>> > chart with the various levels. Level 3 produces the best result.
>> >
>> > It's a bit cumbersome to put months of work in a mail, I am still
>> polishing
>> > the documentation, but feel free to ask questions if this is still
>> unclear.
>> >
>> >
>> > All in all, the technique is best for linear algorithms that work just
>> in
>> > Pregel, for a skewed graph. On a high level, it's a generalisation of
>> > PowerGraph's mirroring, that can work on any system, and that does not
>> > restrict itself to level = 1 (like PowerGraph does) which in the case of
>> > out chart there did not bring us too far.
>> >
>> > Hope that clarified things a bit more!
>> > Andra
>> >
>> > [1] http://arxiv.org/pdf/0808.2633.pdf
>> >
>> >
>> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <parisc@kth.se> wrote:
>> >
>> > > Hi Andra and nice to meet you btw :)
>> > >
>> > > It sounds like very fancy way to deal with skew, I like the idea even
>> > > though I am not a graph analytics expert.
>> > > Have you ran any experiments or benchmarks to see when this
>> preferable ?
>> > > Users should be aware when they will get benefits by using it since
>> node
>> > > splitting doesn’t come with no cost I guess.
>> > > I am really eager to see how this will evolve, I think it’s good
>> effort.
>> > >
>> > > cheers
>> > > Paris
>> > >
>> > >
>> > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.andra@gmail.com>
>> wrote:
>> > > >
>> > > > Hi Vasia,
>> > > >
>> > > > I shall polish the functions a bit, but this is more or less what
I
>> had
>> > > in
>> > > > mind:
>> > > > GSA Jaccard [what we have in Gelly right now]:
>> > > >
>> > >
>> >
>> https://github.com/andralungu/gellypartitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java
>> > > > The same version with node split:
>> > > >
>> > >
>> >
>> https://github.com/andralungu/gellypartitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java
>> > > >
>> > > > Yes, sure I can also share some charts with the results. I could say
>> > for
>> > > > which data sets (power law) and types of algorithms this can be
>> used.
>> > If
>> > > > it's used appropriately, the overhead is 0.
>> > > >
>> > > > I'm open to suggestions!
>> > > > Thanks!
>> > > > Andra
>> > > >
>> > > >
>> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <
>> > > vasilikikalavri@gmail.com
>> > > >> wrote:
>> > > >
>> > > >> Hi Andra,
>> > > >>
>> > > >> thanks for offering to add this work to Gelly and for starting
the
>> > > >> discussion!
>> > > >>
>> > > >> How do you think this would look like from an API point of view?
>> Is it
>> > > easy
>> > > >> to make it transparent to the application? Could you give us a
>> simple
>> > > >> example of what you have in mind?
>> > > >>
>> > > >> Apart from usability, we should also consider documenting when
to
>> use
>> > > this
>> > > >> feature, i.e. for which algorithms, what kind degree distribution
>> and
>> > > what
>> > > >> overhead to expect (if any).
>> > > >>
>> > > >> Cheers,
>> > > >> Vasia.
>> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.andra@gmail.com>
>> > wrote:
>> > > >>
>> > > >>> Hey,
>> > > >>>
>> > > >>> Before actually opening a PR, I wanted to hear your opinion.
So,
>> here
>> > > >> goes
>> > > >>> nothing :).
>> > > >>>
>> > > >>> I'd like to add the core of my master thesis to Gelly. That
is, a
>> > > series
>> > > >> of
>> > > >>> operators that take a skewed graph, split its high degree
vertices
>> > into
>> > > >>> subvertices and redistribute the edges accordingly (thus producing
>> > the
>> > > >>> result faster due to the increased parallelism); then merge
the
>> > > >> subvertices
>> > > >>> into the initial vertex.
>> > > >>>
>> > > >>> Here's part of my unrevised abstract:
>> > > >>> "The Twitter follower graph, the citation network and general
>> purpose
>> > > >>> social networks follow a power law degree distribution. These
>> highly
>> > > >> skewed
>> > > >>> graphs raise challenges to current computation models which
>> uniformly
>> > > >>> process vertices, leading to communication overhead or large
>> > execution
>> > > >> time
>> > > >>> stalls.
>> > > >>>
>> > > >>> In spite of efforts made to scale up computation on natural
graphs
>> > > >>> (PowerGraph ’s Gather  Apply  Scatter model), many performance
>> > > problems
>> > > >>> raise from a subset of vertices that have high degrees. In
this
>> > thesis,
>> > > >> we
>> > > >>> outline the main processing issues that arise in current graph
>> > systems.
>> > > >> We
>> > > >>> then propose a novel node splitting technique meant to
>> differentiate
>> > > the
>> > > >>> computation as well as partitioning strategies for low and
high
>> > degree
>> > > >>> nodes.
>> > > >>>
>> > > >>> The “Node Splitting” abstraction is implemented as a separate
set
>> of
>> > > >>> operators that can be seamlessly combined with current
>> > degreeoblivious
>> > > >>> models such as Pregel and PowerGraph, but also with degreeaware
>> > > engines
>> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal
>> > overhead
>> > > >> in
>> > > >>> the absence of skew and improves the naive implementation
of a
>> subset
>> > > of
>> > > >>> computational intensive algorithms by a factor of two.
>> > > >>>
>> > > >>> These results have been proven on a theoretical and practical
>> level
>> > on
>> > > >> both
>> > > >>> real world and synthetic graphs."
>> > > >>>
>> > > >>> What I would add:
>> > > >>> 1) the operators per se in a separate package
>> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity
and
>> > apply
>> > > >>> these operators on it)
>> > > >>> 3). tests
>> > > >>> 4). a separate section in the gelly docs that explains how
to
>> modify
>> > > your
>> > > >>> code to use this solution.
>> > > >>>
>> > > >>> Needless to say I will maintain the code and, if, as we get
more
>> > users,
>> > > >>> they deem this subAPI useless, we can always deprecate and
>> delete it
>> > > :).
>> > > >>>
>> > > >>> So, what do you think?
>> > > >>> Andra
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>>
>
>
