giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Primitives vs Objects (the Movie!)
Date Wed, 07 Sep 2011 22:42:41 GMT
On Wed, Sep 7, 2011 at 10:33 PM, Avery Ching <aching@apache.org> wrote:

> This probably should have been a JIRA =).


Yeah, probably!


>  I agree that update edge is probably useful as well.  Maybe a Map is the
> right thing then...rather than creating lots of methods to do edge
> manipulation...
>

Or just

  Edge<I,E> getEdge(I targetVertex)

instead of

  E getEdgeValue(I targetVertex)


>
> Avery
>
>
> On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:
>
>> I am going to buck the trend and not inline my thoughts, this is
>> getting a little too thready :)
>>
>> Methinks you will want an updateEdgeValue(), too.
>>
>> D
>>
>> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<aching@apache.org>  wrote:
>>
>>> On 9/7/11 3:00 PM, Jake Mannix wrote:
>>>
>>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<aching@apache.org>  wrote:
>>>
>>>> Haha, this really is turning into a movie =).  I'll start warming up the
>>>> popcorn.
>>>>
>>> Yeah, I've got my co-workers wondering if I'm going to ship any actual
>>> production code *inside* the company this week, at this rate (*shhhhh*
>>> Dmitriy don't tell!)
>>>
>>>
>>> It's only Wednesday...=)
>>>
>>>  On 9/7/11 12:51 PM, Jake Mannix wrote:
>>>>
>>>> Maybe a few more examples would help?  Cases where you want to do a BSP
>>>> computation where the total sort (both the vertexes, and the edges for
>>>> each
>>>> vertex) is required, as is the random access nature of the Map?
>>>>
>>>> I think the range based examples are the ones that immediately come to
>>>> mind.  BSP for graph processing is still pretty new, and I have no idea
>>>> what
>>>> kind of interesting algorithms will be tried out on this platform.  We
>>>> are
>>>> still exploring many possible algorithms to run.
>>>>
>>> Ok, cool.  I can see how wanting flexibility is important.
>>>
>>>  I think the idea was that after returning the map, users could directly
>>>> manipulate the map of edges or use the interfaces, there should probably
>>>> be
>>>> a removeEdge() too.  I'm starting to feel that we should remove Edge
>>>> from
>>>> the user perspective, just keep it internally only for the add edge
>>>> requests.  It just makes things a little more complex to the user (too
>>>> many
>>>> ways to do the same thing).  Perhaps the interfaces you specified could
>>>> hit
>>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(),
>>>> and
>>>> removeEdge()).  If there turns out to be a big need, we can always
>>>> change it
>>>> back to a SortedMap or something else more appropriate.
>>>>
>>>
>>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
>>> sound like the right level of flexibility while keeping the data
>>> encapsulated (so you can try your block compression idea, I can try out
>>> primitives, etc, but the interface remains the same).
>>>
>>>> Memory consumption (see above), these are aggregate members for all the
>>>>> vertices.
>>>>>
>>>> Ok, I'll see what it looks like if this data is moved to something like
>>>> a
>>>> VertexState object attached to the GraphMapper, which all the vertexes
>>>> can
>>>> have a reference to.
>>>>
>>>> As I've thought more about primitives vs objects, I think the object
>>>> flexibility is quite important.  The page rank example could probably
>>>> get
>>>> away with primitives, but other algorithms will likely require objects
>>>> for
>>>> edge values, message values, and vertex values (i.e. maybe storing the
>>>> inlinks, or a bunch of different values i.e. multiple personalized page
>>>> ranks run simultaneously).  I guess you're thinking that Giraph will
>>>> have
>>>> two separate implementations?  One that is primitives based and the
>>>> other
>>>> that is object based?
>>>>
>>> I'm thinking that with the right interface (like discussed above), you
>>> can
>>> have the same base interface, but yeah, for the particular case of
>>> implementers of BaseVertex<I,V,E,M>  where all of I,V, and E are wrappers
>>> of
>>> primitives, that there is some nice memory savings that can be done by
>>> keeping them primitive (and only instantiating objects / autoboxing when
>>> accessing via the generic methods, when this is transiently done).
>>> PageRank isn't the only example where I'd want to get the (suspected)
>>> perf
>>> boost of using primitives, as most cases where I've dealt with graphs
>>> everything gets normalized at some point - the input features all get
>>> eventually turned into an edge "weight" of some kind, the vertexes
>>> themselves maybe keep some small data with them, but the edges just look
>>> like a target vertex id and an edge weight.  For example, for social
>>> graphs,
>>> you can imagine lots and lots of fancy data you associate with users
>>> (geo,
>>> language, account freshness, recent text, topical interests, etc...), and
>>> lots of things to associate with the edges (there are many ways users can
>>> interact, beyond the explicit "x is connected to y" way), but when you
>>> want
>>> to run some big monstrous computation, this data is condensed into some
>>> fixed final "connection strength" combination of weights.  On the other
>>> hand, to actually *compute* that connection weight, maybe non-local
>>> information gleaned from a graph algorithm would be nice.  Similarly,
>>> computing a nice big topic-sensitive pagerank might require a bunch of
>>> topic-weight metadata at the vertexes.
>>> I don't know why I'm arguing this - I agree with you, keeping the
>>> *ability*
>>> to do object stuff is important, yes.  I'm not advocating completely
>>> primitivizing all of the base implementations.  I'm just suggesting that
>>> it
>>> be added, as that's a pretty common use case which could benefit from
>>> some
>>> low-hanging fruit memory savings.
>>>
>>> Sounds right to me, just wanted to make sure that I was understanding
>>> correctly what you wanted to do.
>>>
>>>
>>
>>
>

Mime
View raw message