giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Presta <alessan...@fb.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Mon, 04 Mar 2013 22:52:26 GMT


On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org>
wrote:

>The interface you propose sounds very reasonable. +1 for that.
>
>For floats vs double, they consume much less memory, O(|E|) reduction, and
>there is basically never any need to have a double precision on edges
>because they are immutable.
>Anyway it's not a strong constraint, double edges would work as well.

Sure, makes sense. The question here is whether we should be providing all
possible combinations (basically duplicating a lot of code), or let the
user copy our model implementations.
If there is enough interest, we can always add all combinations of
(Long/Int, Double/Float/Null, Array/HashMap), although it would be tiring
and still not comprehensive (what about String ids?).

>
>Cheers,
>
>--
>Gianmarco
>
>
>On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <alessandro@fb.com>
>wrote:
>
>> Maja and I have thought of a way we could include this type of
>> functionality without it being burdensome of the user.
>>
>> We could add the following interfaces:
>>
>> interface StrictRandomAccesVertexEdges extends VertexEdges {
>>   E getEdgeValue(I)
>> }
>>
>> interface MultiRandomAccessVertexEdges extends VertexEdges {
>>   Iterable<E> getAllEdgeValues(I)
>> }
>>
>> We could add the following to the Vertex interface:
>>
>> // Returns the first edge value for a given target vertex id (or null if
>> none)
>> E getEdgeValue(I targetVertexId) {
>>   // if we have an instance of StrictRandomAccessVertexEdges, we
>>delegate;
>>   // otherwise, we iterate over the edges to find the first one
>> }
>>
>> // Returns an Iterable over all edge values for a given target vertex id
>> (only useful for multigraphs)
>> Iterable<E> getAllEdgeValues(I targetVertexId) {
>>   // if we have an instance of MultiRandomAccessVertexEdges, we
>>delegate;
>>   // otherwise, we wrap the edges into an Iterable that filters for the
>> matching ones
>> }
>>
>> This way, only VertexEdges implementations where it makes sense would
>> define the random-access methods, and a default (slow!) implementation
>> would be available in all other cases.
>>
>> How does this sound? I could include it in GIRAPH-528.
>>
>> On 3/3/13 4:57 PM, "Alessandro Presta" <alessandro@fb.com> wrote:
>>
>> >I don't contest the graph-matrix duality, but I'm not yet convinced
>>that
>> >you need random access to the edges for the scenarios you mentioned.
>> >PageRank and label propagation, two of the most common applications of
>>our
>> >framework, have indeed a linear algebraic formulation, the bulk of
>>which
>> >is always matrix multiplication. The most natural and efficient (if not
>> >the only) implementations in our model involve simply iterating over
>>the
>> >edges.
>> >As for your "dot products" example, maybe it would help if you could
>>be a
>> >bit more specific: what are your vertices and edges? How do we get the
>> >"input vectors"?. Everything should be expressed in terms of vertex
>> >values, edges, and messages, otherwise it doesn't fit the current
>>Giraph
>> >model anyway.
>> >In any case, to compute a dot product efficiently you don't need random
>> >access to both vectors: if you have the non-zero coordinates of vector
>>A
>> >(equivalent of the edge list), and random access to vector B, you
>>iterate
>> >over A and lookup in B. This is efficient.
>> >
>> >> I'm not sure what this would correspond to in graphland, but I can
>> >> certainly see
>> >> wanting to have a big in-memory matrix which you can compute dot
>> >>products
>> >> of vectors with each row of it, and Giraph would do this very
>> >>efficiently
>> >> (but some
>> >> of the implementations of this would assume you had random access to
>>the
>> >> edges (meaning edgeId and edgeWeight of each vertex).
>> >
>> >
>> >Here we are talking about a method with signature
>> >
>> >    EdgeValueType getEdgeValue(VertexIdType targetVertexId)
>> >
>> >or the multigraph version
>> >
>> >    Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId)
>> >
>> >I'm looking for examples of algorithms that can't be implemented in
>>Giraph
>> >because of the lack of this method, and would otherwise be a good fit.
>> >
>> >Having said that, I think there *is* still a case for hash-map-backed
>> >edges, and that is when we need to guarantee a "strict graph" (i.e. no
>> >parallel edges) or when our algorithm uses remote edge removal
>>requests.
>> >So I'm not against having an optimized LongDoubleHashMapEdges (and
>> >LongNullHashSetEdges for the unweighted case) included, provided that
>>it's
>> >correct and efficient.
>> >I will make sure I include it. Now it's a matter of picking a primitive
>> >collections library among HPPC, fastutil, Mahout, Trove, etc...
>> >
>> >On 3/1/13 4:12 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>> >
>> >>Sorry to have missed out on some of this conversation - had some work
>> >>stuff
>> >>interrupt (how dare it!)
>> >>
>> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
>> >><alessandro@fb.com>wrote:
>> >>
>> >>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>> >>>
>> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>> >>> ><alessandro@fb.com>wrote:
>> >>> >
>> >>> >> It's not like it causes problems, it's just that it's a pretty
>>big
>> >>> >> dependency to justify for a small example.
>> >>> >>
>> >>> >> As for the motivation, if your point is to prove the framework's
>> >>> >> superiority in some context, then you can use the simplest
>>possible
>> >>> >> implementation (ArrayList).
>> >>> >>
>> >>> >
>> >>> >This takes up LOTS of memory.  Primitives rule, objects drool
>> >>>(memory).
>> >>>
>> >>> Ok, but having to copy the keys and values to external arrays
>>(which is
>> >>> what you have to do with Mahout's hash map) is even worse. A good
>> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex)
>>is
>> >>> primitive arrays.
>> >>>
>> >>
>> >>So it is certainly true that the *typical* usage of iteration over
>>Mahout
>> >>maps
>> >>is by doing this copy of internal values (which is not efficient, no),
>> >>it's
>> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
>> >>passes in a callback which operates directly on the underlying
>>primitive
>> >>arrays.
>> >>
>> >>
>> >>>
>> >>> >
>> >>> >
>> >>> >> The Giraph framework is all about iterating over edges, so
an
>> >>> >> implementation that doesn't support that with reasonable
>>efficiency
>> >>> >> doesn't make a lot of sense to me.
>> >>> >>
>> >>> >> It also follows that hash map-based implementations only make
>>sense
>> >>>for
>> >>> >> algorithms that make use of mutations.
>> >>> >>
>> >>> >
>> >>> >Agreed, maps aren't absolutely necessary for immutable graphs. 
But
>> >>>random
>> >>> >access to collections of integers _is_ necessary for many
>>algorithms.
>> >>> >It's
>> >>> >not all just iteration over lists.
>> >>>
>> >>> Can you give me some concrete examples of algorithms where random
>> >>>access
>> >>> to the edges is required?
>> >>> I'm really interested in this, because I'm considering killing
>> >>> getEdgeValue() if there are no use cases.
>> >>>
>> >>
>> >>So I'm not a graph person, I think in terms of matrices, but since
>>every
>> >>graph
>> >>has an adjecency matrix, and every matrix is the adjacency matrix of
>>some
>> >>(possibly
>> >>bipartite) graph, lets pretend they're basically the same:
>> >>
>> >>If you load a graph into Giraph, and want to compute the matrix
>>product
>> >>of
>> >>this
>> >>graph with collection of input vectors, then you may want to take
>>each of
>> >>these
>> >>input vectors and compute their dot products with the vertices of the
>> >>graph
>> >>(to
>> >>mix my metaphors: each vertex is essentially a row or column of the
>> >>matrix),
>> >>which can be done by either iterating over the input vector and
>>looking
>> >>up
>> >>randomly for nonzero entries in the vertex, or the reverse.
>> >>
>> >>I'm not sure what this would correspond to in graphland, but I can
>> >>certainly see
>> >>wanting to have a big in-memory matrix which you can compute dot
>>products
>> >>of vectors with each row of it, and Giraph would do this very
>>efficiently
>> >>(but some
>> >>of the implementations of this would assume you had random access to
>>the
>> >>edges (meaning edgeId and edgeWeight of each vertex).
>> >>
>> >>
>> >>>
>> >>> >
>> >>> >
>> >>> >> That said, something like a Trove hash map would probably be
more
>> >>> >> appropriate (more efficient than the standard Java HashMap,
at
>>the
>> >>> >>expense
>> >>> >> of generality). That could be a good candidate for a
>> >>> >> LongDoubleHashMapEdges implementation.
>> >>> >> I can give that a shot if it sounds good.
>> >>> >>
>> >>> >
>> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>> >>>
>> >>> Whoops, totally missed that part.
>> >>>
>> >>> >
>> >>> >What does Giraph depend on of Mahout now?  Just mahout-collections?
>> >>> >That's
>> >>> >not a very big dependency, and has all sorts of
>>primitive-to-primitive
>> >>> >collections,
>> >>> >and from what I've seen benchmarked, just as good or better than
>> >>>Trove.
>> >>> > Carrot2's
>> >>> >hppc may be better yet, but I'm not sure if that is stably released
>> >>>yet.
>> >>>
>> >>> I'll take a look at HPPC and Mahout collections then. They all seem
>>to
>> >>> provide the same stuff, so I'll consider benchmarks and convenience
>>of
>> >>>the
>> >>> API. Thanks for the pointers.
>> >>>
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
>> >>> >>
>> >>> >> >Is the mahout dependency causing problems?
>> >>> >> >
>> >>> >> >It would be nice if we could actually implement some of
the
>> >>>algorithms
>> >>> >> >that
>> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to
show
>>off
>> >>>how
>> >>> >>it
>> >>> >> >improves things.  Using the Mahout primitives can show
that it's
>> >>>not
>> >>> >>about
>> >>> >> >the inner loop implementation, but the framework itself...
>> >>> >> >
>> >>> >> >
>> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>> >>> >> ><apache.mailbox@gmail.com>wrote:
>> >>> >> >
>> >>> >> >> I like the idea of refactoring it into something more
>>appropriate
>> >>> >>for us
>> >>> >> >> and ditching the Mahout dep. Good looking out.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella
<
>> >>> >> >> claudio.martella@gmail.com> wrote:
>> >>> >> >>
>> >>> >> >> > I agree, at this point we could have a RandomWalkVertex
with
>> >>>edge
>> >>> >> >>values,
>> >>> >> >> > and a "null-edged" vertex for the PR benchmarks.
>> >>> >> >> > We make everybody happy and avoid code duplication.
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
>> >>> >><alessandro@fb.com
>> >>> >> >> > >wrote:
>> >>> >> >> >
>> >>> >> >> > > Hi Gianmarco,
>> >>> >> >> > >
>> >>> >> >> > > Yes, there will be more efficient implementations.
>> >>> >> >> > > In the redesign I'm working on (GIRAPH-528),
there will be
>> >>>only
>> >>> >>one
>> >>> >> >> > Vertex
>> >>> >> >> > > class and edge storage is delegated to a
VertexEdges
>>class.
>> >>> >> >> > > So far I'm adding some generic implementations
>> >>>(ByteArrayEdges,
>> >>> >> >> > > ArrayListEdges, HashMapEdges) that work
for all types, and
>> >>>some
>> >>> >> >> optimized
>> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
>> >>> >> >> > >
>> >>> >> >> > > Do you specifically need edge values to
be float while the
>> >>>other
>> >>> >> >>types
>> >>> >> >> > are
>> >>> >> >> > > double?
>> >>> >> >> > > It seems to me it would make sense to change
>>RandomWalkVertex
>> >>>to
>> >>> >>use
>> >>> >> >> > > double edge values instead, and avoid code
duplication
>>(i.e.
>> >>> >>adding
>> >>> >> >>a
>> >>> >> >> > > LongFloatArrayEdges that's basically the
same). We're not
>> >>>Trove
>> >>> >> >>after
>> >>> >> >> > all.
>> >>> >> >> > > Makes sense?
>> >>> >> >> > >
>> >>> >> >> > > Thanks for the feedback,
>> >>> >> >> > >
>> >>> >> >> > > Alessandro
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci
Morales"
>> >>> >> >><gdfm@apache.org>
>> >>> >> >> > > wrote:
>> >>> >> >> > >
>> >>> >> >> > > >Hi,
>> >>> >> >> > > >
>> >>> >> >> > > >Maybe the specific implementation can
be thrown away, but
>> >>> >> >>personally I
>> >>> >> >> > > >feel
>> >>> >> >> > > >very strongly for the need of a good
>>LongDoubleFloatDouble
>> >>> >>vertex.
>> >>> >> >> > > >It's the base for any serious random
walk algorithm.
>> >>> >> >> > > >
>> >>> >> >> > > >I would call for a refactoring rather
than a removal.
>> >>> >> >> > > >
>> >>> >> >> > > >Just my 2c.
>> >>> >> >> > > >
>> >>> >> >> > > >Cheers,
>> >>> >> >> > > >
>> >>> >> >> > > >--
>> >>> >> >> > > >Gianmarco
>> >>> >> >> > > >
>> >>> >> >> > > >
>> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro
Presta
>> >>> >> >> > > ><alessandro@fb.com>wrote:
>> >>> >> >> > > >
>> >>> >> >> > > >> Hi all,
>> >>> >> >> > > >>
>> >>> >> >> > > >> Does anyone feel strongly for
>>LongDoubleFloatDoubleVertex?
>> >>> >> >> > > >> Reasons why I think it should be
removed:
>> >>> >> >> > > >>
>> >>> >> >> > > >>   1.  Right now it's incorrect
(returns target vertex
>>id
>> >>>as
>> >>> >>edge
>> >>> >> >> > value).
>> >>> >> >> > > >>   2.  Iteration will always be
inefficient, since the
>> >>> >>underlying
>> >>> >> >> > Mahout
>> >>> >> >> > > >> open-addressing hash map implementation
doesn't provide
>> >>> >> >>iterators.
>> >>> >> >> It
>> >>> >> >> > > >> provides a way to copy the keys
and values to external
>> >>> >> >>arrays/lists.
>> >>> >> >> > > >>   3.  It's the only reason why
we have Mahout as a
>> >>>dependency.
>> >>> >> >> > > >>
>> >>> >> >> > > >> I think we should strive to provide
model
>>implementations
>> >>>that
>> >>> >> >>are
>> >>> >> >> > > >>generic
>> >>> >> >> > > >> and/or extremely efficient. This
one satisfies neither.
>> >>> >> >> > > >>
>> >>> >> >> > > >> Thanks,
>> >>> >> >> > > >>
>> >>> >> >> > > >> Alessandro
>> >>> >> >> > > >>
>> >>> >> >> > >
>> >>> >> >> > >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > --
>> >>> >> >> >    Claudio Martella
>> >>> >> >> >    claudio.martella@gmail.com
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >--
>> >>> >> >
>> >>> >> >  -jake
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >--
>> >>> >
>> >>> >  -jake
>> >>>
>> >>>
>> >>
>> >>
>> >>--
>> >>
>> >>  -jake
>> >
>>
>>


Mime
View raw message