Mailing-List: contact giraph-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: giraph-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com
 designates 209.85.213.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4E67F143.5030501@apache.org>
References: 
 <CACYXym9iLstxWybc0roRQaAtz8a5wF8O2VV3-d8_SdQ-dEoRoQ@mail.gmail.com>
 <4E66C2C2.1010808@apache.org>
 <CACYXym9Xf1G0xyLW5XwqwfxBwVu9jwTaKn5y+NhBkmnP8t_Ltw@mail.gmail.com>
 <4E671030.9030103@apache.org>
 <CACYXym-BebJTBZDJAOWS+=7Rr7vd6LGt97gFpwYe_sh5-uE29g@mail.gmail.com>
 <4E67E19B.1070204@apache.org>
 <CACYXym8f96utEFFbsaiZFoLOLzd+skWf3iG=XNqUFQwMtph70A@mail.gmail.com>
 <4E67ECDC.8060802@apache.org>
 <CA+98KLTknbZW21Umo4zZQTbLC8vfVm_k2n-_1+AZDQhSOGe4Lw@mail.gmail.com>
 <4E67F143.5030501@apache.org>
From: Jake Mannix <jake.mannix@gmail.com>
Date: Wed, 7 Sep 2011 22:42:41 +0000
Message-ID: 
 <CACYXym_13n9oSD+VnJmHZ3wxR77WM3fAbpJFr1Y4ConiFL+rLw@mail.gmail.com>
Subject: Re: Primitives vs Objects (the Movie!)
To: Avery Ching <aching@apache.org>
Cc: giraph-user@incubator.apache.org, Dmitriy Ryaboy <dmitriy@twitter.com>
Content-Type: multipart/alternative; boundary=000e0cd2537c9d568f04ac61aa4c

--000e0cd2537c9d568f04ac61aa4c
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Sep 7, 2011 at 10:33 PM, Avery Ching <aching@apache.org> wrote:

> This probably should have been a JIRA =).


Yeah, probably!


>  I agree that update edge is probably useful as well.  Maybe a Map is the
> right thing then...rather than creating lots of methods to do edge
> manipulation...
>

Or just

  Edge<I,E> getEdge(I targetVertex)

instead of

  E getEdgeValue(I targetVertex)


>
> Avery
>
>
> On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:
>
>> I am going to buck the trend and not inline my thoughts, this is
>> getting a little too thready :)
>>
>> Methinks you will want an updateEdgeValue(), too.
>>
>> D
>>
>> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<aching@apache.org>  wrote:
>>
>>> On 9/7/11 3:00 PM, Jake Mannix wrote:
>>>
>>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<aching@apache.org>  wrote:
>>>
>>>> Haha, this really is turning into a movie =).  I'll start warming up the
>>>> popcorn.
>>>>
>>> Yeah, I've got my co-workers wondering if I'm going to ship any actual
>>> production code *inside* the company this week, at this rate (*shhhhh*
>>> Dmitriy don't tell!)
>>>
>>>
>>> It's only Wednesday...=)
>>>
>>>  On 9/7/11 12:51 PM, Jake Mannix wrote:
>>>>
>>>> Maybe a few more examples would help?  Cases where you want to do a BSP
>>>> computation where the total sort (both the vertexes, and the edges for
>>>> each
>>>> vertex) is required, as is the random access nature of the Map?
>>>>
>>>> I think the range based examples are the ones that immediately come to
>>>> mind.  BSP for graph processing is still pretty new, and I have no idea
>>>> what
>>>> kind of interesting algorithms will be tried out on this platform.  We
>>>> are
>>>> still exploring many possible algorithms to run.
>>>>
>>> Ok, cool.  I can see how wanting flexibility is important.
>>>
>>>  I think the idea was that after returning the map, users could directly
>>>> manipulate the map of edges or use the interfaces, there should probably
>>>> be
>>>> a removeEdge() too.  I'm starting to feel that we should remove Edge
>>>> from
>>>> the user perspective, just keep it internally only for the add edge
>>>> requests.  It just makes things a little more complex to the user (too
>>>> many
>>>> ways to do the same thing).  Perhaps the interfaces you specified could
>>>> hit
>>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(),
>>>> and
>>>> removeEdge()).  If there turns out to be a big need, we can always
>>>> change it
>>>> back to a SortedMap or something else more appropriate.
>>>>
>>>
>>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
>>> sound like the right level of flexibility while keeping the data
>>> encapsulated (so you can try your block compression idea, I can try out
>>> primitives, etc, but the interface remains the same).
>>>
>>>> Memory consumption (see above), these are aggregate members for all the
>>>>> vertices.
>>>>>
>>>> Ok, I'll see what it looks like if this data is moved to something like
>>>> a
>>>> VertexState object attached to the GraphMapper, which all the vertexes
>>>> can
>>>> have a reference to.
>>>>
>>>> As I've thought more about primitives vs objects, I think the object
>>>> flexibility is quite important.  The page rank example could probably
>>>> get
>>>> away with primitives, but other algorithms will likely require objects
>>>> for
>>>> edge values, message values, and vertex values (i.e. maybe storing the
>>>> inlinks, or a bunch of different values i.e. multiple personalized page
>>>> ranks run simultaneously).  I guess you're thinking that Giraph will
>>>> have
>>>> two separate implementations?  One that is primitives based and the
>>>> other
>>>> that is object based?
>>>>
>>> I'm thinking that with the right interface (like discussed above), you
>>> can
>>> have the same base interface, but yeah, for the particular case of
>>> implementers of BaseVertex<I,V,E,M>  where all of I,V, and E are wrappers
>>> of
>>> primitives, that there is some nice memory savings that can be done by
>>> keeping them primitive (and only instantiating objects / autoboxing when
>>> accessing via the generic methods, when this is transiently done).
>>> PageRank isn't the only example where I'd want to get the (suspected)
>>> perf
>>> boost of using primitives, as most cases where I've dealt with graphs
>>> everything gets normalized at some point - the input features all get
>>> eventually turned into an edge "weight" of some kind, the vertexes
>>> themselves maybe keep some small data with them, but the edges just look
>>> like a target vertex id and an edge weight.  For example, for social
>>> graphs,
>>> you can imagine lots and lots of fancy data you associate with users
>>> (geo,
>>> language, account freshness, recent text, topical interests, etc...), and
>>> lots of things to associate with the edges (there are many ways users can
>>> interact, beyond the explicit "x is connected to y" way), but when you
>>> want
>>> to run some big monstrous computation, this data is condensed into some
>>> fixed final "connection strength" combination of weights.  On the other
>>> hand, to actually *compute* that connection weight, maybe non-local
>>> information gleaned from a graph algorithm would be nice.  Similarly,
>>> computing a nice big topic-sensitive pagerank might require a bunch of
>>> topic-weight metadata at the vertexes.
>>> I don't know why I'm arguing this - I agree with you, keeping the
>>> *ability*
>>> to do object stuff is important, yes.  I'm not advocating completely
>>> primitivizing all of the base implementations.  I'm just suggesting that
>>> it
>>> be added, as that's a pretty common use case which could benefit from
>>> some
>>> low-hanging fruit memory savings.
>>>
>>> Sounds right to me, just wanted to make sure that I was understanding
>>> correctly what you wanted to do.
>>>
>>>
>>
>>
>

--000e0cd2537c9d568f04ac61aa4c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<br><br><div class=3D"gmail_quote">On Wed, Sep 7, 2011 at 10:33 PM, Avery C=
hing <span dir=3D"ltr">&lt;<a href=3D"mailto:aching@apache.org">aching@apac=
he.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

This probably should have been a JIRA =3D). </blockquote><div><br></div><di=
v>Yeah, probably!</div><div>=A0</div><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">=A0I a=
gree that update edge is probably useful as well. =A0Maybe a Map is the rig=
ht thing then...rather than creating lots of methods to do edge manipulatio=
n...<br>

</blockquote><div><br></div><div>Or just=A0</div><div><br></div><div>=A0 Ed=
ge&lt;I,E&gt; getEdge(I targetVertex)</div><div><br></div><div>instead of=
=A0</div><div><br></div><div>=A0 E getEdgeValue(I targetVertex)</div><div>=
=A0</div>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;"><font color=3D"#888888">
<br>
Avery</font><div><div></div><div class=3D"h5"><br>
<br>
On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I am going to buck the trend and not inline my thoughts, this is<br>
getting a little too thready :)<br>
<br>
Methinks you will want an updateEdgeValue(), too.<br>
<br>
D<br>
<br>
On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching&lt;<a href=3D"mailto:aching@apa=
che.org" target=3D"_blank">aching@apache.org</a>&gt; =A0wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
On 9/7/11 3:00 PM, Jake Mannix wrote:<br>
<br>
On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching&lt;<a href=3D"mailto:aching@apa=
che.org" target=3D"_blank">aching@apache.org</a>&gt; =A0wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Haha, this really is turning into a movie =3D). =A0I&#39;ll start warming u=
p the<br>
popcorn.<br>
</blockquote>
Yeah, I&#39;ve got my co-workers wondering if I&#39;m going to ship any act=
ual<br>
production code *inside* the company this week, at this rate (*shhhhh*<br>
Dmitriy don&#39;t tell!)<br>
<br>
<br>
It&#39;s only Wednesday...=3D)<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
On 9/7/11 12:51 PM, Jake Mannix wrote:<br>
<br>
Maybe a few more examples would help? =A0Cases where you want to do a BSP<b=
r>
computation where the total sort (both the vertexes, and the edges for each=
<br>
vertex) is required, as is the random access nature of the Map?<br>
<br>
I think the range based examples are the ones that immediately come to<br>
mind. =A0BSP for graph processing is still pretty new, and I have no idea w=
hat<br>
kind of interesting algorithms will be tried out on this platform. =A0We ar=
e<br>
still exploring many possible algorithms to run.<br>
</blockquote>
Ok, cool. =A0I can see how wanting flexibility is important.<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
I think the idea was that after returning the map, users could directly<br>
manipulate the map of edges or use the interfaces, there should probably be=
<br>
a removeEdge() too. =A0I&#39;m starting to feel that we should remove Edge =
from<br>
the user perspective, just keep it internally only for the add edge<br>
requests. =A0It just makes things a little more complex to the user (too ma=
ny<br>
ways to do the same thing). =A0Perhaps the interfaces you specified could h=
it<br>
most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and<=
br>
removeEdge()). =A0If there turns out to be a big need, we can always change=
 it<br>
back to a SortedMap or something else more appropriate.<br>
</blockquote>
<br>
getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()<br>
sound like the right level of flexibility while keeping the data<br>
encapsulated (so you can try your block compression idea, I can try out<br>
primitives, etc, but the interface remains the same).<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Memory consumption (see above), these are aggregate members for all the<br>
vertices.<br>
</blockquote>
Ok, I&#39;ll see what it looks like if this data is moved to something like=
 a<br>
VertexState object attached to the GraphMapper, which all the vertexes can<=
br>
have a reference to.<br>
<br>
As I&#39;ve thought more about primitives vs objects, I think the object<br=
>
flexibility is quite important. =A0The page rank example could probably get=
<br>
away with primitives, but other algorithms will likely require objects for<=
br>
edge values, message values, and vertex values (i.e. maybe storing the<br>
inlinks, or a bunch of different values i.e. multiple personalized page<br>
ranks run simultaneously). =A0I guess you&#39;re thinking that Giraph will =
have<br>
two separate implementations? =A0One that is primitives based and the other=
<br>
that is object based?<br>
</blockquote>
I&#39;m thinking that with the right interface (like discussed above), you =
can<br>
have the same base interface, but yeah, for the particular case of<br>
implementers of BaseVertex&lt;I,V,E,M&gt; =A0where all of I,V, and E are wr=
appers of<br>
primitives, that there is some nice memory savings that can be done by<br>
keeping them primitive (and only instantiating objects / autoboxing when<br=
>
accessing via the generic methods, when this is transiently done).<br>
PageRank isn&#39;t the only example where I&#39;d want to get the (suspecte=
d) perf<br>
boost of using primitives, as most cases where I&#39;ve dealt with graphs<b=
r>
everything gets normalized at some point - the input features all get<br>
eventually turned into an edge &quot;weight&quot; of some kind, the vertexe=
s<br>
themselves maybe keep some small data with them, but the edges just look<br=
>
like a target vertex id and an edge weight. =A0For example, for social grap=
hs,<br>
you can imagine lots and lots of fancy data you associate with users (geo,<=
br>
language, account freshness, recent text, topical interests, etc...), and<b=
r>
lots of things to associate with the edges (there are many ways users can<b=
r>
interact, beyond the explicit &quot;x is connected to y&quot; way), but whe=
n you want<br>
to run some big monstrous computation, this data is condensed into some<br>
fixed final &quot;connection strength&quot; combination of weights. =A0On t=
he other<br>
hand, to actually *compute* that connection weight, maybe non-local<br>
information gleaned from a graph algorithm would be nice. =A0Similarly,<br>
computing a nice big topic-sensitive pagerank might require a bunch of<br>
topic-weight metadata at the vertexes.<br>
I don&#39;t know why I&#39;m arguing this - I agree with you, keeping the *=
ability*<br>
to do object stuff is important, yes. =A0I&#39;m not advocating completely<=
br>
primitivizing all of the base implementations. =A0I&#39;m just suggesting t=
hat it<br>
be added, as that&#39;s a pretty common use case which could benefit from s=
ome<br>
low-hanging fruit memory savings.<br>
<br>
Sounds right to me, just wanted to make sure that I was understanding<br>
correctly what you wanted to do.<br>
<br>
</blockquote>
<br>
<br>
</blockquote>
<br>
</div></div></blockquote></div><br>

--000e0cd2537c9d568f04ac61aa4c--