giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Reisman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GIRAPH-388) Improve the way we keep outgoing messages
Date Wed, 31 Oct 2012 19:10:11 GMT

    [ https://issues.apache.org/jira/browse/GIRAPH-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488111#comment-13488111
] 

Eli Reisman commented on GIRAPH-388:
------------------------------------

Exactly, the idea of splitting hubs and neighborhoods is referring to the same kind of problem
I was referencing above about message duplication where a supernode belongs to a given parition
on a worker and a lot of vertices on other workers have out eddges to that supernode. The
lumpiness of the social graph data makes Giraph behave very differently than running benchmarks
configured to same scale of input data size.

I also agree about edge-based partitioning as a good idea for balancing the social graph data
it already came in really handy for me earlier last summer while working on the input superstep.
This was also a flushing issue, in which measuring outgoing graph partition data by # of vertices
per flush rather than # of edges was resulting in workers crashing when they read or were
assigned a supernode or two and tried to read/write them to the wire. An outgoing buffer with
a supernode in it (and many many out-edges) was so much bigger than a buffer of typical-sized
vertices it was crashing the IPC. Tuning the flushing there was critical to scaling Giraph
up under the memory constraints I was trying to meet. The GIRAPH-232 metrics with Graphite
graphs were very illustrative as to how different the benchmark and social data made the framework
behave as a job ran.

As you said before, messaging is a different situation. If you think the flushing and/or deduplication
isn't going to help save memory per-worker, I'm happy to shift focus to where the good solutions
are. If you think the deduplication issue can be addressed better another way, that sounds
good too.

I'd love to see more ideas (and more fleshed out ideas) on the mailing list about how some
of you who know a lot about this subject but don't have a lot of time to code up an example
would attack these problems. There are a number of us who are happy to try to code up a good
idea, and not afraid to go down a blind alley with you to see if something works. Many graph
tools I've reviewed for ideas seem top-to-bottom optimized for particular uses. Giraph is
a more general framework. Are there some existing solutions you've seen out there we should
be looking at or emulating to solve some of these problems? 

                
> Improve the way we keep outgoing messages
> -----------------------------------------
>
>                 Key: GIRAPH-388
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-388
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Maja Kabiljo
>            Assignee: Maja Kabiljo
>         Attachments: GIRAPH-388.patch
>
>
> As per discussion on GIRAPH-357, in standard application chances that we get to use client-side
combiner are very low. I experimented with benefits which we can get from not having the client-side
combiner at all. It turns out that having a lot of maps in SendMessageCache, and then collection
inside each of them, really hurts the performance. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message