Return-Path: X-Original-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-giraph-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1D1BE86DA for ; Wed, 7 Sep 2011 22:43:31 +0000 (UTC) Received: (qmail 47626 invoked by uid 500); 7 Sep 2011 22:43:31 -0000 Delivered-To: apmail-incubator-giraph-user-archive@incubator.apache.org Received: (qmail 47596 invoked by uid 500); 7 Sep 2011 22:43:30 -0000 Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: giraph-user@incubator.apache.org Delivered-To: mailing list giraph-user@incubator.apache.org Received: (qmail 47568 invoked by uid 99); 7 Sep 2011 22:43:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Sep 2011 22:43:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.213.175 as permitted sender) Received: from [209.85.213.175] (HELO mail-yx0-f175.google.com) (209.85.213.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Sep 2011 22:43:22 +0000 Received: by yxj17 with SMTP id 17so149153yxj.6 for ; Wed, 07 Sep 2011 15:43:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=Iy//R5wit5IJmDkIVXR54WZ6ueD3cOIi2VYa2B30KNA=; b=HamIRoXv2lpczqGAm6hdPH5cJBFy4hc8dHrmmfKc/C0AHZAQd5dU4mC0xabL4L+5s1 1VtbuxOLgSqfP2r+VKh1O3F38fseF5xlyNutEIYr4Z+p+mXLhY7zvSRRBpyRDpvO7J0x Ec6yZT6tt+nj4cSIaSDxYsOgXbXz76JFnQ9mM= Received: by 10.150.8.10 with SMTP id 10mr176695ybh.60.1315435381085; Wed, 07 Sep 2011 15:43:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.153.12 with HTTP; Wed, 7 Sep 2011 15:42:41 -0700 (PDT) In-Reply-To: <4E67F143.5030501@apache.org> References: <4E66C2C2.1010808@apache.org> <4E671030.9030103@apache.org> <4E67E19B.1070204@apache.org> <4E67ECDC.8060802@apache.org> <4E67F143.5030501@apache.org> From: Jake Mannix Date: Wed, 7 Sep 2011 22:42:41 +0000 Message-ID: Subject: Re: Primitives vs Objects (the Movie!) To: Avery Ching Cc: giraph-user@incubator.apache.org, Dmitriy Ryaboy Content-Type: multipart/alternative; boundary=000e0cd2537c9d568f04ac61aa4c X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd2537c9d568f04ac61aa4c Content-Type: text/plain; charset=ISO-8859-1 On Wed, Sep 7, 2011 at 10:33 PM, Avery Ching wrote: > This probably should have been a JIRA =). Yeah, probably! > I agree that update edge is probably useful as well. Maybe a Map is the > right thing then...rather than creating lots of methods to do edge > manipulation... > Or just Edge getEdge(I targetVertex) instead of E getEdgeValue(I targetVertex) > > Avery > > > On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote: > >> I am going to buck the trend and not inline my thoughts, this is >> getting a little too thready :) >> >> Methinks you will want an updateEdgeValue(), too. >> >> D >> >> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching wrote: >> >>> On 9/7/11 3:00 PM, Jake Mannix wrote: >>> >>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching wrote: >>> >>>> Haha, this really is turning into a movie =). I'll start warming up the >>>> popcorn. >>>> >>> Yeah, I've got my co-workers wondering if I'm going to ship any actual >>> production code *inside* the company this week, at this rate (*shhhhh* >>> Dmitriy don't tell!) >>> >>> >>> It's only Wednesday...=) >>> >>> On 9/7/11 12:51 PM, Jake Mannix wrote: >>>> >>>> Maybe a few more examples would help? Cases where you want to do a BSP >>>> computation where the total sort (both the vertexes, and the edges for >>>> each >>>> vertex) is required, as is the random access nature of the Map? >>>> >>>> I think the range based examples are the ones that immediately come to >>>> mind. BSP for graph processing is still pretty new, and I have no idea >>>> what >>>> kind of interesting algorithms will be tried out on this platform. We >>>> are >>>> still exploring many possible algorithms to run. >>>> >>> Ok, cool. I can see how wanting flexibility is important. >>> >>> I think the idea was that after returning the map, users could directly >>>> manipulate the map of edges or use the interfaces, there should probably >>>> be >>>> a removeEdge() too. I'm starting to feel that we should remove Edge >>>> from >>>> the user perspective, just keep it internally only for the add edge >>>> requests. It just makes things a little more complex to the user (too >>>> many >>>> ways to do the same thing). Perhaps the interfaces you specified could >>>> hit >>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), >>>> and >>>> removeEdge()). If there turns out to be a big need, we can always >>>> change it >>>> back to a SortedMap or something else more appropriate. >>>> >>> >>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge() >>> sound like the right level of flexibility while keeping the data >>> encapsulated (so you can try your block compression idea, I can try out >>> primitives, etc, but the interface remains the same). >>> >>>> Memory consumption (see above), these are aggregate members for all the >>>>> vertices. >>>>> >>>> Ok, I'll see what it looks like if this data is moved to something like >>>> a >>>> VertexState object attached to the GraphMapper, which all the vertexes >>>> can >>>> have a reference to. >>>> >>>> As I've thought more about primitives vs objects, I think the object >>>> flexibility is quite important. The page rank example could probably >>>> get >>>> away with primitives, but other algorithms will likely require objects >>>> for >>>> edge values, message values, and vertex values (i.e. maybe storing the >>>> inlinks, or a bunch of different values i.e. multiple personalized page >>>> ranks run simultaneously). I guess you're thinking that Giraph will >>>> have >>>> two separate implementations? One that is primitives based and the >>>> other >>>> that is object based? >>>> >>> I'm thinking that with the right interface (like discussed above), you >>> can >>> have the same base interface, but yeah, for the particular case of >>> implementers of BaseVertex where all of I,V, and E are wrappers >>> of >>> primitives, that there is some nice memory savings that can be done by >>> keeping them primitive (and only instantiating objects / autoboxing when >>> accessing via the generic methods, when this is transiently done). >>> PageRank isn't the only example where I'd want to get the (suspected) >>> perf >>> boost of using primitives, as most cases where I've dealt with graphs >>> everything gets normalized at some point - the input features all get >>> eventually turned into an edge "weight" of some kind, the vertexes >>> themselves maybe keep some small data with them, but the edges just look >>> like a target vertex id and an edge weight. For example, for social >>> graphs, >>> you can imagine lots and lots of fancy data you associate with users >>> (geo, >>> language, account freshness, recent text, topical interests, etc...), and >>> lots of things to associate with the edges (there are many ways users can >>> interact, beyond the explicit "x is connected to y" way), but when you >>> want >>> to run some big monstrous computation, this data is condensed into some >>> fixed final "connection strength" combination of weights. On the other >>> hand, to actually *compute* that connection weight, maybe non-local >>> information gleaned from a graph algorithm would be nice. Similarly, >>> computing a nice big topic-sensitive pagerank might require a bunch of >>> topic-weight metadata at the vertexes. >>> I don't know why I'm arguing this - I agree with you, keeping the >>> *ability* >>> to do object stuff is important, yes. I'm not advocating completely >>> primitivizing all of the base implementations. I'm just suggesting that >>> it >>> be added, as that's a pretty common use case which could benefit from >>> some >>> low-hanging fruit memory savings. >>> >>> Sounds right to me, just wanted to make sure that I was understanding >>> correctly what you wanted to do. >>> >>> >> >> > --000e0cd2537c9d568f04ac61aa4c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Wed, Sep 7, 2011 at 10:33 PM, Avery C= hing <aching@apac= he.org> wrote:
This probably should have been a JIRA =3D).

Yeah, probably!
=A0
=A0I a= gree that update edge is probably useful as well. =A0Maybe a Map is the rig= ht thing then...rather than creating lots of methods to do edge manipulatio= n...

Or just=A0

=A0 Ed= ge<I,E> getEdge(I targetVertex)

instead of= =A0

=A0 E getEdgeValue(I targetVertex)
= =A0

Avery


On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:
I am going to buck the trend and not inline my thoughts, this is
getting a little too thready :)

Methinks you will want an updateEdgeValue(), too.

D

On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<aching@apache.org> =A0wrote:
On 9/7/11 3:00 PM, Jake Mannix wrote:

On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<aching@apache.org> =A0wrote:
Haha, this really is turning into a movie =3D). =A0I'll start warming u= p the
popcorn.
Yeah, I've got my co-workers wondering if I'm going to ship any act= ual
production code *inside* the company this week, at this rate (*shhhhh*
Dmitriy don't tell!)


It's only Wednesday...=3D)

On 9/7/11 12:51 PM, Jake Mannix wrote:

Maybe a few more examples would help? =A0Cases where you want to do a BSP computation where the total sort (both the vertexes, and the edges for each=
vertex) is required, as is the random access nature of the Map?

I think the range based examples are the ones that immediately come to
mind. =A0BSP for graph processing is still pretty new, and I have no idea w= hat
kind of interesting algorithms will be tried out on this platform. =A0We ar= e
still exploring many possible algorithms to run.
Ok, cool. =A0I can see how wanting flexibility is important.

I think the idea was that after returning the map, users could directly
manipulate the map of edges or use the interfaces, there should probably be=
a removeEdge() too. =A0I'm starting to feel that we should remove Edge = from
the user perspective, just keep it internally only for the add edge
requests. =A0It just makes things a little more complex to the user (too ma= ny
ways to do the same thing). =A0Perhaps the interfaces you specified could h= it
most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and<= br> removeEdge()). =A0If there turns out to be a big need, we can always change= it
back to a SortedMap or something else more appropriate.

getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
sound like the right level of flexibility while keeping the data
encapsulated (so you can try your block compression idea, I can try out
primitives, etc, but the interface remains the same).
Memory consumption (see above), these are aggregate members for all the
vertices.
Ok, I'll see what it looks like if this data is moved to something like= a
VertexState object attached to the GraphMapper, which all the vertexes can<= br> have a reference to.

As I've thought more about primitives vs objects, I think the object flexibility is quite important. =A0The page rank example could probably get=
away with primitives, but other algorithms will likely require objects for<= br> edge values, message values, and vertex values (i.e. maybe storing the
inlinks, or a bunch of different values i.e. multiple personalized page
ranks run simultaneously). =A0I guess you're thinking that Giraph will = have
two separate implementations? =A0One that is primitives based and the other=
that is object based?
I'm thinking that with the right interface (like discussed above), you = can
have the same base interface, but yeah, for the particular case of
implementers of BaseVertex<I,V,E,M> =A0where all of I,V, and E are wr= appers of
primitives, that there is some nice memory savings that can be done by
keeping them primitive (and only instantiating objects / autoboxing when accessing via the generic methods, when this is transiently done).
PageRank isn't the only example where I'd want to get the (suspecte= d) perf
boost of using primitives, as most cases where I've dealt with graphs everything gets normalized at some point - the input features all get
eventually turned into an edge "weight" of some kind, the vertexe= s
themselves maybe keep some small data with them, but the edges just look like a target vertex id and an edge weight. =A0For example, for social grap= hs,
you can imagine lots and lots of fancy data you associate with users (geo,<= br> language, account freshness, recent text, topical interests, etc...), and lots of things to associate with the edges (there are many ways users can interact, beyond the explicit "x is connected to y" way), but whe= n you want
to run some big monstrous computation, this data is condensed into some
fixed final "connection strength" combination of weights. =A0On t= he other
hand, to actually *compute* that connection weight, maybe non-local
information gleaned from a graph algorithm would be nice. =A0Similarly,
computing a nice big topic-sensitive pagerank might require a bunch of
topic-weight metadata at the vertexes.
I don't know why I'm arguing this - I agree with you, keeping the *= ability*
to do object stuff is important, yes. =A0I'm not advocating completely<= br> primitivizing all of the base implementations. =A0I'm just suggesting t= hat it
be added, as that's a pretty common use case which could benefit from s= ome
low-hanging fruit memory savings.

Sounds right to me, just wanted to make sure that I was understanding
correctly what you wanted to do.





--000e0cd2537c9d568f04ac61aa4c--