giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Primitives vs Objects (the Movie!)
Date Wed, 07 Sep 2011 01:02:58 GMT
On 9/6/11 3:27 PM, Jake Mannix wrote:
> (changing thread title to reflect current discussion topic)
> On Tue, Sep 6, 2011 at 8:49 AM, Avery Ching < 
> <>> wrote:
>     Answers are inlined.  No vacation for you this weekend I guess  =).
> It was a good "vacation" :)
>>       Which JIRAs?
> - Balancing memory
>     consumption among workers
> - Communication
>     threads consuming too much memory
> GIRAPH-12 is interesting.  That communication could also be possibly 
> handled by something like Finagle 
> <> ("A fault tolerant, 
> protocol-agnostic RPC system" based on Netty [which I see is already 
> under consideration], written in scala, but with very mature java 
> bindings too).  We use it internally at Twitter for clusters of 
> mid-tier servers which have many dozens of machines talking to 
> hundreds of other machines, without blowing up on thread-stack or 
> using a gazillion threads.  It's mavenized, so it's easy to try out.
We should definitely look at it, thanks for the suggestion.  I've added 
it to the JIRA.
> In the current state, I'd definitely be worried about running with 
> hundreds of mappers, given the number of threads which would be used...
I agree, although I still have gotten up to 500 workers to run 
simultaneously with testing and a reduced thread stack size.  Can't wait 
to see how big we can go with this issue resolved.

>>         I was able to run the
>>         org.apache.giraph.benchmark.PageRankBenchmark with 300
>>         workers and 1 billion vertices with a single edge and 20
>>         supersteps.  Here's the parameters I used for our configuration:
>>     1B vertices with just *one* edge?  What kind of graph would this be??
>     Very lame, I agree.  Just for numbers =).  Once some of the memory
>     issues are worked out, we'll test more connected graphs.
> Yeah, one edge is pretty silly.  To get some real numbers, I should 
> try it out with a more realistic (power-law distributed) bit of 
> synthetic data.


>>     That JIRA ticket seems to be talking about sorted inputs - is
>>     this a requirement (that vertex ids are sorted in each split), or
>>     does this just make some things run much better?  What does this
>>     have to do with balanced input splits?
>     Yes, currently inputs must be sorted (requirement) until GIRAPH-12
>     is finished.  Balancing input splits (memory consumed per input
>     split) will help keep the amount of memory similar on all workers
>     and assuming a homogenous worker set, this will allow for larger
>     data sets to be fit into memory.
> I see, since you're effectively limited by the size of the biggest 
> split in this case, if you're not careful.
>>     I guess I'm not sure whether you *need* to give up the OO API,
>>     it's really nice to have completely strongly typed graph
>>     primitives.  I'll just see if I can implement the same API using
>>     internal data structures which are nice and compact (as a
>>     performance improvement only) which in the case where all of the
>>     type parameters are primitive-able.  So for example, something
>>     like PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable,
>>     DoubleWritable> implements MutableVertex<LongWritable,
>>     DoubleWritable, FloatWritable, DoubleWritable>.  Still
>>     API-compatible, but internally takes advantage of the fact that
>>     all types are wrapped primitives.
>     Interesting idea, would like to see the results of this experiment.
> So I've started coding it up, and I ran into an API stumbling block, 
> which might be avoidable:
> in the interface BasicVertex:
>     SortedMap<I, Edge<I, E>> getOutEdgeMap();
> Where is this needed, in this form?
Lots of questions here, I'll try my best. =)  Basically the idea is that 
as a sorted map, the user would be able to know that the edges are 
sorted by their destination ids.  Range based edge queries are possible 
(i.e. the edges with destinations from to 
com.zillow.www).  Basically this would give the users a bit more 
functionality than a basic map.  A list would require a full scan to 
find/remove an edge.
>  The only calls I see to this method are of the form 
> "getOutEdgeMap().size()", and
> "for(Edge<> edge : getOutEdgeMap().values()) {...}".
> What exactly is the intended use of this map?  It's a map from target 
> vertex id to that exact vertex and it's edge value.
> Why is it not just a List<Edge<I,E>>?
> Because in theory you could be modifying the structure of the graph on 
> the fly,
> and you don't want to walk around on that list?  Then why not just 
> Map<I, E> ?  Why is the vertex type need to be specified
> twice, and why does the map need to be sorted?
As far as why not (Sorted)Map<I, E>, I believe we actually used to have 
that interface and I changed it use the Edge class.  This was primarily 
done since Edge objects would be used for edge requests (i.e. void 
addEdgeRequest(I sourceVertexId, Edge<I, E> edge) stores the add edge 
request with the Edge object in a list.  I think I also changed the user 
interfaces to use the Edge object since I initially thought it might 
make usage it a little clearer for users (i.e. edge.getDestVertexid() 
and some of the serialization a little simpler, but looking at it now, 
that might not be the case.  We can probably go back to Map<I, E> or 
SortedMap<I, E> to save some memory and internally use the Edge object 
to store the add edge requests.
> For my present case, I can probably hack around it by making a virtual
> SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which 
> implements that interface, yet is backed by a
> primitive map, but I'm trying to understand what the API is trying to 
> support...
Hope my explanation helped somewhat.
>   -jake

View raw message