giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Presta <alessan...@fb.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Mon, 04 Mar 2013 00:57:57 GMT
I don't contest the graph-matrix duality, but I'm not yet convinced that
you need random access to the edges for the scenarios you mentioned.
PageRank and label propagation, two of the most common applications of our
framework, have indeed a linear algebraic formulation, the bulk of which
is always matrix multiplication. The most natural and efficient (if not
the only) implementations in our model involve simply iterating over the
edges.
As for your "dot products" example, maybe it would help if you could be a
bit more specific: what are your vertices and edges? How do we get the
"input vectors"?. Everything should be expressed in terms of vertex
values, edges, and messages, otherwise it doesn't fit the current Giraph
model anyway.
In any case, to compute a dot product efficiently you don't need random
access to both vectors: if you have the non-zero coordinates of vector A
(equivalent of the edge list), and random access to vector B, you iterate
over A and lookup in B. This is efficient.

> I'm not sure what this would correspond to in graphland, but I can
> certainly see
> wanting to have a big in-memory matrix which you can compute dot products
> of vectors with each row of it, and Giraph would do this very efficiently
> (but some
> of the implementations of this would assume you had random access to the
> edges (meaning edgeId and edgeWeight of each vertex).


Here we are talking about a method with signature

    EdgeValueType getEdgeValue(VertexIdType targetVertexId)

or the multigraph version

    Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId)

I'm looking for examples of algorithms that can't be implemented in Giraph
because of the lack of this method, and would otherwise be a good fit.

Having said that, I think there *is* still a case for hash-map-backed
edges, and that is when we need to guarantee a "strict graph" (i.e. no
parallel edges) or when our algorithm uses remote edge removal requests.
So I'm not against having an optimized LongDoubleHashMapEdges (and
LongNullHashSetEdges for the unweighted case) included, provided that it's
correct and efficient.
I will make sure I include it. Now it's a matter of picking a primitive
collections library among HPPC, fastutil, Mahout, Trove, etc...

On 3/1/13 4:12 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:

>Sorry to have missed out on some of this conversation - had some work
>stuff
>interrupt (how dare it!)
>
>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
><alessandro@fb.com>wrote:
>
>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>>
>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>> ><alessandro@fb.com>wrote:
>> >
>> >> It's not like it causes problems, it's just that it's a pretty big
>> >> dependency to justify for a small example.
>> >>
>> >> As for the motivation, if your point is to prove the framework's
>> >> superiority in some context, then you can use the simplest possible
>> >> implementation (ArrayList).
>> >>
>> >
>> >This takes up LOTS of memory.  Primitives rule, objects drool (memory).
>>
>> Ok, but having to copy the keys and values to external arrays (which is
>> what you have to do with Mahout's hash map) is even worse. A good
>> implementation of (long, double) edges (e.g. for RandomWalkVertex) is
>> primitive arrays.
>>
>
>So it is certainly true that the *typical* usage of iteration over Mahout
>maps
>is by doing this copy of internal values (which is not efficient, no),
>it's
>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
>passes in a callback which operates directly on the underlying primitive
>arrays.
>
>
>>
>> >
>> >
>> >> The Giraph framework is all about iterating over edges, so an
>> >> implementation that doesn't support that with reasonable efficiency
>> >> doesn't make a lot of sense to me.
>> >>
>> >> It also follows that hash map-based implementations only make sense
>>for
>> >> algorithms that make use of mutations.
>> >>
>> >
>> >Agreed, maps aren't absolutely necessary for immutable graphs.  But
>>random
>> >access to collections of integers _is_ necessary for many algorithms.
>> >It's
>> >not all just iteration over lists.
>>
>> Can you give me some concrete examples of algorithms where random access
>> to the edges is required?
>> I'm really interested in this, because I'm considering killing
>> getEdgeValue() if there are no use cases.
>>
>
>So I'm not a graph person, I think in terms of matrices, but since every
>graph
>has an adjecency matrix, and every matrix is the adjacency matrix of some
>(possibly
>bipartite) graph, lets pretend they're basically the same:
>
>If you load a graph into Giraph, and want to compute the matrix product of
>this
>graph with collection of input vectors, then you may want to take each of
>these
>input vectors and compute their dot products with the vertices of the
>graph
>(to
>mix my metaphors: each vertex is essentially a row or column of the
>matrix),
>which can be done by either iterating over the input vector and looking up
>randomly for nonzero entries in the vertex, or the reverse.
>
>I'm not sure what this would correspond to in graphland, but I can
>certainly see
>wanting to have a big in-memory matrix which you can compute dot products
>of vectors with each row of it, and Giraph would do this very efficiently
>(but some
>of the implementations of this would assume you had random access to the
>edges (meaning edgeId and edgeWeight of each vertex).
>
>
>>
>> >
>> >
>> >> That said, something like a Trove hash map would probably be more
>> >> appropriate (more efficient than the standard Java HashMap, at the
>> >>expense
>> >> of generality). That could be a good candidate for a
>> >> LongDoubleHashMapEdges implementation.
>> >> I can give that a shot if it sounds good.
>> >>
>> >
>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>>
>> Whoops, totally missed that part.
>>
>> >
>> >What does Giraph depend on of Mahout now?  Just mahout-collections?
>> >That's
>> >not a very big dependency, and has all sorts of primitive-to-primitive
>> >collections,
>> >and from what I've seen benchmarked, just as good or better than Trove.
>> > Carrot2's
>> >hppc may be better yet, but I'm not sure if that is stably released
>>yet.
>>
>> I'll take a look at HPPC and Mahout collections then. They all seem to
>> provide the same stuff, so I'll consider benchmarks and convenience of
>>the
>> API. Thanks for the pointers.
>>
>> >
>> >
>> >>
>> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>> >>
>> >> >Is the mahout dependency causing problems?
>> >> >
>> >> >It would be nice if we could actually implement some of the
>>algorithms
>> >> >that
>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off
>>how
>> >>it
>> >> >improves things.  Using the Mahout primitives can show that it's not
>> >>about
>> >> >the inner loop implementation, but the framework itself...
>> >> >
>> >> >
>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>> >> ><apache.mailbox@gmail.com>wrote:
>> >> >
>> >> >> I like the idea of refactoring it into something more appropriate
>> >>for us
>> >> >> and ditching the Mahout dep. Good looking out.
>> >> >>
>> >> >>
>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella <
>> >> >> claudio.martella@gmail.com> wrote:
>> >> >>
>> >> >> > I agree, at this point we could have a RandomWalkVertex with
>>edge
>> >> >>values,
>> >> >> > and a "null-edged" vertex for the PR benchmarks.
>> >> >> > We make everybody happy and avoid code duplication.
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
>> >><alessandro@fb.com
>> >> >> > >wrote:
>> >> >> >
>> >> >> > > Hi Gianmarco,
>> >> >> > >
>> >> >> > > Yes, there will be more efficient implementations.
>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will
be
>>only
>> >>one
>> >> >> > Vertex
>> >> >> > > class and edge storage is delegated to a VertexEdges
class.
>> >> >> > > So far I'm adding some generic implementations
>>(ByteArrayEdges,
>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types,
and
>>some
>> >> >> optimized
>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
>> >> >> > >
>> >> >> > > Do you specifically need edge values to be float while
the
>>other
>> >> >>types
>> >> >> > are
>> >> >> > > double?
>> >> >> > > It seems to me it would make sense to change RandomWalkVertex
>>to
>> >>use
>> >> >> > > double edge values instead, and avoid code duplication
(i.e.
>> >>adding
>> >> >>a
>> >> >> > > LongFloatArrayEdges that's basically the same). We're
not
>>Trove
>> >> >>after
>> >> >> > all.
>> >> >> > > Makes sense?
>> >> >> > >
>> >> >> > > Thanks for the feedback,
>> >> >> > >
>> >> >> > > Alessandro
>> >> >> > >
>> >> >> > >
>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales"
>> >> >><gdfm@apache.org>
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > > >Hi,
>> >> >> > > >
>> >> >> > > >Maybe the specific implementation can be thrown away,
but
>> >> >>personally I
>> >> >> > > >feel
>> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble
>> >>vertex.
>> >> >> > > >It's the base for any serious random walk algorithm.
>> >> >> > > >
>> >> >> > > >I would call for a refactoring rather than a removal.
>> >> >> > > >
>> >> >> > > >Just my 2c.
>> >> >> > > >
>> >> >> > > >Cheers,
>> >> >> > > >
>> >> >> > > >--
>> >> >> > > >Gianmarco
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta
>> >> >> > > ><alessandro@fb.com>wrote:
>> >> >> > > >
>> >> >> > > >> Hi all,
>> >> >> > > >>
>> >> >> > > >> Does anyone feel strongly for LongDoubleFloatDoubleVertex?
>> >> >> > > >> Reasons why I think it should be removed:
>> >> >> > > >>
>> >> >> > > >>   1.  Right now it's incorrect (returns target
vertex id as
>> >>edge
>> >> >> > value).
>> >> >> > > >>   2.  Iteration will always be inefficient,
since the
>> >>underlying
>> >> >> > Mahout
>> >> >> > > >> open-addressing hash map implementation doesn't
provide
>> >> >>iterators.
>> >> >> It
>> >> >> > > >> provides a way to copy the keys and values to
external
>> >> >>arrays/lists.
>> >> >> > > >>   3.  It's the only reason why we have Mahout
as a
>>dependency.
>> >> >> > > >>
>> >> >> > > >> I think we should strive to provide model implementations
>>that
>> >> >>are
>> >> >> > > >>generic
>> >> >> > > >> and/or extremely efficient. This one satisfies
neither.
>> >> >> > > >>
>> >> >> > > >> Thanks,
>> >> >> > > >>
>> >> >> > > >> Alessandro
>> >> >> > > >>
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> >    Claudio Martella
>> >> >> >    claudio.martella@gmail.com
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >--
>> >> >
>> >> >  -jake
>> >>
>> >>
>> >
>> >
>> >--
>> >
>> >  -jake
>>
>>
>
>
>-- 
>
>  -jake


Mime
View raw message