giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: Primitives vs Objects (the Movie!)
Date Wed, 07 Sep 2011 21:26:51 GMT
Haha, this really is turning into a movie =).  I'll start warming up the 
popcorn.

On 9/7/11 12:51 PM, Jake Mannix wrote:
>
>
> On Wed, Sep 7, 2011 at 6:33 AM, Avery Ching <aching@apache.org 
> <mailto:aching@apache.org>> wrote:
>
>>       I'm not sure that this is precisely the right API, but exposing
>>     the inner SortedMap definitely has a "leaky abstraction" smell to
>>     it to me, especially where there are no examples or algorithm
>>     implementations in the codebase which currently *need* this to be
>>     exposed in such a way.
>     We could expose just the Map and provide helper methods as you
>     suggest.  It's a thought.  You're right that it's not used in this
>     way in any examples.
>
>
> Maybe a few more examples would help?  Cases where you want to do a 
> BSP computation where the total sort (both the vertexes, and the edges 
> for each vertex) is required, as is the random access nature of the Map?
I think the range based examples are the ones that immediately come to 
mind.  BSP for graph processing is still pretty new, and I have no idea 
what kind of interesting algorithms will be tried out on this platform.  
We are still exploring many possible algorithms to run.

>>     I can see how the Edge object is nice, it does seem like a better
>>     api then basically just a Map<I,E>.  Why not a sorted
>>     List<Edge<I,E>>, then, where the Comparator ignores the edge
>>     value, but only sorts by the dest vertexId?  Because the contract
>>     of List doesn't *require* the sort order to remain fixed, I
>>     guess?  This yet again underscores for me the "spidey-sense"
>>     wrongness of exposing the SortedMap exactly as it is.  Maybe just
>>     expose the methods on a map that are expected to be needed, and
>>     methods can be added as time goes on?
>     I think Map over List makes sense for the edges to find them
>     quickly (my personal opinion).
>
>
> Yeah, if there are cases where you're doing searches for paths, this 
> is probably true, some form of random access is probably best.
>
> What about something as simple as:
>
>   /*
>    * @return an immutable, sorted view of the target vertices
>    */
>   public ImmutableList<I> getTargetVertices();
>
>   /*
>    * @return E edge for the supplied vertex (or null if there isn't an 
> edge to this vertex from the current one)
>    */
>   public <E> Edge<I,E> getEdge(I vertex);
>
> in conjunction with the already existing
>
>   public boolean addEdge(Edge<I,E> edge);
>
> you have all the actions of a sorted, random-access map, but without 
> exposing the way it's stored internally.  Why is there addEdge() but 
> not removeEdge()?
I think the idea was that after returning the map, users could directly 
manipulate the map of edges or use the interfaces, there should probably 
be a removeEdge() too.  I'm starting to feel that we should remove Edge 
from the user perspective, just keep it internally only for the add edge 
requests.  It just makes things a little more complex to the user (too 
many ways to do the same thing).  Perhaps the interfaces you specified 
could hit most of the use cases (getTargetVertices(), getEdgeValue(), 
addEdge(), and removeEdge()).  If there turns out to be a big need, we 
can always change it back to a SortedMap or something else more appropriate.

>>       Problem 1)  static mutable state (ew!) - lots of references to
>>     Vertex.setSuperStep(), Vertex.setNumVertices(),
>>     Vertex.setNumEdges(), Vertex.setGraphMapper(), etc... which means
>>     that implemeting BasicVertex / MutableVertex isn't good enough,
>>     you have to actually be a subclass of Vertex! :\
>     Oh, you found our dirty little secret. =)  Yes, we do have static
>     mutable state.  The concern is that if every vertex actually keeps
>     track of all those things as members, I think a decent amount of
>     memory would be wasted.  I'm sure there is a cleaner way to
>     implement this though (i.e. Flyweight pattern perhaps), just never
>     got around to fixing it.
>
>
> Yeah, keeping one extra 4 or 8 bytes of a reference to say, the 
> GraphMapper inside of each Vertex, and having the GraphMapper itself 
> centralize all of these generic Vertex attributes might be the way to 
> go.  I'll see if I can whip that up as part of my primitive patch.
Cool.

>>       Problem 2) Vertex.class is directly referenced as the base
>>     class in places like BspUtils, etc.
>     We never expected to have another base Vertex class. =)
>
>
> True, but in many of these places, it's just the interface type which 
> is required, right?  BaseVertex is what they all implement, is there a 
> reason why this can't be what's referred to in these cases?
>
>>     Most likely the easiest solution would be to pull out a
>>     AbstractVertex class, which holds this ugly static state, and
>>     make said state protected, instead of private, and have Vertex
>>     and LongDoubleFloatDoubleVertex both subclass it, and change
>>     reflection references from setClass(VERTEX_CLASS, vertexClass,
>>     Vertex.class); to setClass(VERTEX_CLASS, vertexClass,
>>     AbstractVertex.class);
>     That sounds reasonable.  One thing we have thought about is
>     compressing the vertices in memory at the cost of some CPU.  While
>     this is orthogonal, they are both efforts to conserve memory.
>
>
> Storing the edges as primitive arrays could be viewed as a fairly 
> maximally efficient compression, in cases where you have 
> primitive-able types.  I'm not sure how easy it would be to have a 
> generically compressible SortedMap<I, Edge<E,I>>, in a way which 
> allowed mutations of the map...
Still only in the thought phase, but I think the idea would be some kind 
of block compression of vertices (compress say 10k vertices at a time 
and keep an index).

>     Memory consumption (see above), these are aggregate members for
>     all the vertices.
>
>
> Ok, I'll see what it looks like if this data is moved to something 
> like a VertexState object attached to the GraphMapper, which all the 
> vertexes can have a reference to.
As I've thought more about primitives vs objects, I think the object 
flexibility is quite important.  The page rank example could probably 
get away with primitives, but other algorithms will likely require 
objects for edge values, message values, and vertex values (i.e. maybe 
storing the inlinks, or a bunch of different values i.e. multiple 
personalized page ranks run simultaneously).  I guess you're thinking 
that Giraph will have two separate implementations?  One that is 
primitives based and the other that is object based?

Mime
View raw message