incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: Primitives vs Objects (the Movie!)
Date Wed, 07 Sep 2011 22:33:39 GMT
This probably should have been a JIRA =).  I agree that update edge is 
probably useful as well.  Maybe a Map is the right thing then...rather 
than creating lots of methods to do edge manipulation...

Avery

On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:
> I am going to buck the trend and not inline my thoughts, this is
> getting a little too thready :)
>
> Methinks you will want an updateEdgeValue(), too.
>
> D
>
> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<aching@apache.org>  wrote:
>> On 9/7/11 3:00 PM, Jake Mannix wrote:
>>
>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<aching@apache.org>  wrote:
>>> Haha, this really is turning into a movie =).  I'll start warming up the
>>> popcorn.
>> Yeah, I've got my co-workers wondering if I'm going to ship any actual
>> production code *inside* the company this week, at this rate (*shhhhh*
>> Dmitriy don't tell!)
>>
>>
>> It's only Wednesday...=)
>>
>>> On 9/7/11 12:51 PM, Jake Mannix wrote:
>>>
>>> Maybe a few more examples would help?  Cases where you want to do a BSP
>>> computation where the total sort (both the vertexes, and the edges for each
>>> vertex) is required, as is the random access nature of the Map?
>>>
>>> I think the range based examples are the ones that immediately come to
>>> mind.  BSP for graph processing is still pretty new, and I have no idea what
>>> kind of interesting algorithms will be tried out on this platform.  We are
>>> still exploring many possible algorithms to run.
>> Ok, cool.  I can see how wanting flexibility is important.
>>
>>> I think the idea was that after returning the map, users could directly
>>> manipulate the map of edges or use the interfaces, there should probably be
>>> a removeEdge() too.  I'm starting to feel that we should remove Edge from
>>> the user perspective, just keep it internally only for the add edge
>>> requests.  It just makes things a little more complex to the user (too many
>>> ways to do the same thing).  Perhaps the interfaces you specified could hit
>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and
>>> removeEdge()).  If there turns out to be a big need, we can always change it
>>> back to a SortedMap or something else more appropriate.
>>
>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
>> sound like the right level of flexibility while keeping the data
>> encapsulated (so you can try your block compression idea, I can try out
>> primitives, etc, but the interface remains the same).
>>>> Memory consumption (see above), these are aggregate members for all the
>>>> vertices.
>>> Ok, I'll see what it looks like if this data is moved to something like a
>>> VertexState object attached to the GraphMapper, which all the vertexes can
>>> have a reference to.
>>>
>>> As I've thought more about primitives vs objects, I think the object
>>> flexibility is quite important.  The page rank example could probably get
>>> away with primitives, but other algorithms will likely require objects for
>>> edge values, message values, and vertex values (i.e. maybe storing the
>>> inlinks, or a bunch of different values i.e. multiple personalized page
>>> ranks run simultaneously).  I guess you're thinking that Giraph will have
>>> two separate implementations?  One that is primitives based and the other
>>> that is object based?
>> I'm thinking that with the right interface (like discussed above), you can
>> have the same base interface, but yeah, for the particular case of
>> implementers of BaseVertex<I,V,E,M>  where all of I,V, and E are wrappers of
>> primitives, that there is some nice memory savings that can be done by
>> keeping them primitive (and only instantiating objects / autoboxing when
>> accessing via the generic methods, when this is transiently done).
>> PageRank isn't the only example where I'd want to get the (suspected) perf
>> boost of using primitives, as most cases where I've dealt with graphs
>> everything gets normalized at some point - the input features all get
>> eventually turned into an edge "weight" of some kind, the vertexes
>> themselves maybe keep some small data with them, but the edges just look
>> like a target vertex id and an edge weight.  For example, for social graphs,
>> you can imagine lots and lots of fancy data you associate with users (geo,
>> language, account freshness, recent text, topical interests, etc...), and
>> lots of things to associate with the edges (there are many ways users can
>> interact, beyond the explicit "x is connected to y" way), but when you want
>> to run some big monstrous computation, this data is condensed into some
>> fixed final "connection strength" combination of weights.  On the other
>> hand, to actually *compute* that connection weight, maybe non-local
>> information gleaned from a graph algorithm would be nice.  Similarly,
>> computing a nice big topic-sensitive pagerank might require a bunch of
>> topic-weight metadata at the vertexes.
>> I don't know why I'm arguing this - I agree with you, keeping the *ability*
>> to do object stuff is important, yes.  I'm not advocating completely
>> primitivizing all of the base implementations.  I'm just suggesting that it
>> be added, as that's a pretty common use case which could benefit from some
>> low-hanging fruit memory savings.
>>
>> Sounds right to me, just wanted to make sure that I was understanding
>> correctly what you wanted to do.
>>
>
>


Mime
View raw message