giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Mon, 04 Mar 2013 22:52:47 GMT
Yep, I'm also +1 on the approach.


On Mon, Mar 4, 2013 at 11:47 PM, Gianmarco De Francisci Morales <
gdfm@apache.org> wrote:

> The interface you propose sounds very reasonable. +1 for that.
>
> For floats vs double, they consume much less memory, O(|E|) reduction, and
> there is basically never any need to have a double precision on edges
> because they are immutable.
> Anyway it's not a strong constraint, double edges would work as well.
>
> Cheers,
>
> --
> Gianmarco
>
>
> On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <alessandro@fb.com>
> wrote:
>
> > Maja and I have thought of a way we could include this type of
> > functionality without it being burdensome of the user.
> >
> > We could add the following interfaces:
> >
> > interface StrictRandomAccesVertexEdges extends VertexEdges {
> >   E getEdgeValue(I)
> > }
> >
> > interface MultiRandomAccessVertexEdges extends VertexEdges {
> >   Iterable<E> getAllEdgeValues(I)
> > }
> >
> > We could add the following to the Vertex interface:
> >
> > // Returns the first edge value for a given target vertex id (or null if
> > none)
> > E getEdgeValue(I targetVertexId) {
> >   // if we have an instance of StrictRandomAccessVertexEdges, we
> delegate;
> >   // otherwise, we iterate over the edges to find the first one
> > }
> >
> > // Returns an Iterable over all edge values for a given target vertex id
> > (only useful for multigraphs)
> > Iterable<E> getAllEdgeValues(I targetVertexId) {
> >   // if we have an instance of MultiRandomAccessVertexEdges, we delegate;
> >   // otherwise, we wrap the edges into an Iterable that filters for the
> > matching ones
> > }
> >
> > This way, only VertexEdges implementations where it makes sense would
> > define the random-access methods, and a default (slow!) implementation
> > would be available in all other cases.
> >
> > How does this sound? I could include it in GIRAPH-528.
> >
> > On 3/3/13 4:57 PM, "Alessandro Presta" <alessandro@fb.com> wrote:
> >
> > >I don't contest the graph-matrix duality, but I'm not yet convinced that
> > >you need random access to the edges for the scenarios you mentioned.
> > >PageRank and label propagation, two of the most common applications of
> our
> > >framework, have indeed a linear algebraic formulation, the bulk of which
> > >is always matrix multiplication. The most natural and efficient (if not
> > >the only) implementations in our model involve simply iterating over the
> > >edges.
> > >As for your "dot products" example, maybe it would help if you could be
> a
> > >bit more specific: what are your vertices and edges? How do we get the
> > >"input vectors"?. Everything should be expressed in terms of vertex
> > >values, edges, and messages, otherwise it doesn't fit the current Giraph
> > >model anyway.
> > >In any case, to compute a dot product efficiently you don't need random
> > >access to both vectors: if you have the non-zero coordinates of vector A
> > >(equivalent of the edge list), and random access to vector B, you
> iterate
> > >over A and lookup in B. This is efficient.
> > >
> > >> I'm not sure what this would correspond to in graphland, but I can
> > >> certainly see
> > >> wanting to have a big in-memory matrix which you can compute dot
> > >>products
> > >> of vectors with each row of it, and Giraph would do this very
> > >>efficiently
> > >> (but some
> > >> of the implementations of this would assume you had random access to
> the
> > >> edges (meaning edgeId and edgeWeight of each vertex).
> > >
> > >
> > >Here we are talking about a method with signature
> > >
> > >    EdgeValueType getEdgeValue(VertexIdType targetVertexId)
> > >
> > >or the multigraph version
> > >
> > >    Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId)
> > >
> > >I'm looking for examples of algorithms that can't be implemented in
> Giraph
> > >because of the lack of this method, and would otherwise be a good fit.
> > >
> > >Having said that, I think there *is* still a case for hash-map-backed
> > >edges, and that is when we need to guarantee a "strict graph" (i.e. no
> > >parallel edges) or when our algorithm uses remote edge removal requests.
> > >So I'm not against having an optimized LongDoubleHashMapEdges (and
> > >LongNullHashSetEdges for the unweighted case) included, provided that
> it's
> > >correct and efficient.
> > >I will make sure I include it. Now it's a matter of picking a primitive
> > >collections library among HPPC, fastutil, Mahout, Trove, etc...
> > >
> > >On 3/1/13 4:12 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> > >
> > >>Sorry to have missed out on some of this conversation - had some work
> > >>stuff
> > >>interrupt (how dare it!)
> > >>
> > >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
> > >><alessandro@fb.com>wrote:
> > >>
> > >>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> > >>>
> > >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
> > >>> ><alessandro@fb.com>wrote:
> > >>> >
> > >>> >> It's not like it causes problems, it's just that it's a pretty
big
> > >>> >> dependency to justify for a small example.
> > >>> >>
> > >>> >> As for the motivation, if your point is to prove the framework's
> > >>> >> superiority in some context, then you can use the simplest
> possible
> > >>> >> implementation (ArrayList).
> > >>> >>
> > >>> >
> > >>> >This takes up LOTS of memory.  Primitives rule, objects drool
> > >>>(memory).
> > >>>
> > >>> Ok, but having to copy the keys and values to external arrays (which
> is
> > >>> what you have to do with Mahout's hash map) is even worse. A good
> > >>> implementation of (long, double) edges (e.g. for RandomWalkVertex)
is
> > >>> primitive arrays.
> > >>>
> > >>
> > >>So it is certainly true that the *typical* usage of iteration over
> Mahout
> > >>maps
> > >>is by doing this copy of internal values (which is not efficient, no),
> > >>it's
> > >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
> > >>passes in a callback which operates directly on the underlying
> primitive
> > >>arrays.
> > >>
> > >>
> > >>>
> > >>> >
> > >>> >
> > >>> >> The Giraph framework is all about iterating over edges, so
an
> > >>> >> implementation that doesn't support that with reasonable
> efficiency
> > >>> >> doesn't make a lot of sense to me.
> > >>> >>
> > >>> >> It also follows that hash map-based implementations only make
> sense
> > >>>for
> > >>> >> algorithms that make use of mutations.
> > >>> >>
> > >>> >
> > >>> >Agreed, maps aren't absolutely necessary for immutable graphs.
 But
> > >>>random
> > >>> >access to collections of integers _is_ necessary for many
> algorithms.
> > >>> >It's
> > >>> >not all just iteration over lists.
> > >>>
> > >>> Can you give me some concrete examples of algorithms where random
> > >>>access
> > >>> to the edges is required?
> > >>> I'm really interested in this, because I'm considering killing
> > >>> getEdgeValue() if there are no use cases.
> > >>>
> > >>
> > >>So I'm not a graph person, I think in terms of matrices, but since
> every
> > >>graph
> > >>has an adjecency matrix, and every matrix is the adjacency matrix of
> some
> > >>(possibly
> > >>bipartite) graph, lets pretend they're basically the same:
> > >>
> > >>If you load a graph into Giraph, and want to compute the matrix product
> > >>of
> > >>this
> > >>graph with collection of input vectors, then you may want to take each
> of
> > >>these
> > >>input vectors and compute their dot products with the vertices of the
> > >>graph
> > >>(to
> > >>mix my metaphors: each vertex is essentially a row or column of the
> > >>matrix),
> > >>which can be done by either iterating over the input vector and looking
> > >>up
> > >>randomly for nonzero entries in the vertex, or the reverse.
> > >>
> > >>I'm not sure what this would correspond to in graphland, but I can
> > >>certainly see
> > >>wanting to have a big in-memory matrix which you can compute dot
> products
> > >>of vectors with each row of it, and Giraph would do this very
> efficiently
> > >>(but some
> > >>of the implementations of this would assume you had random access to
> the
> > >>edges (meaning edgeId and edgeWeight of each vertex).
> > >>
> > >>
> > >>>
> > >>> >
> > >>> >
> > >>> >> That said, something like a Trove hash map would probably
be more
> > >>> >> appropriate (more efficient than the standard Java HashMap,
at the
> > >>> >>expense
> > >>> >> of generality). That could be a good candidate for a
> > >>> >> LongDoubleHashMapEdges implementation.
> > >>> >> I can give that a shot if it sounds good.
> > >>> >>
> > >>> >
> > >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
> > >>>
> > >>> Whoops, totally missed that part.
> > >>>
> > >>> >
> > >>> >What does Giraph depend on of Mahout now?  Just mahout-collections?
> > >>> >That's
> > >>> >not a very big dependency, and has all sorts of
> primitive-to-primitive
> > >>> >collections,
> > >>> >and from what I've seen benchmarked, just as good or better than
> > >>>Trove.
> > >>> > Carrot2's
> > >>> >hppc may be better yet, but I'm not sure if that is stably released
> > >>>yet.
> > >>>
> > >>> I'll take a look at HPPC and Mahout collections then. They all seem
> to
> > >>> provide the same stuff, so I'll consider benchmarks and convenience
> of
> > >>>the
> > >>> API. Thanks for the pointers.
> > >>>
> > >>> >
> > >>> >
> > >>> >>
> > >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
> > >>> >>
> > >>> >> >Is the mahout dependency causing problems?
> > >>> >> >
> > >>> >> >It would be nice if we could actually implement some of
the
> > >>>algorithms
> > >>> >> >that
> > >>> >> >Mahout does via map-reduce in Giraph's BSP formalism,
to show off
> > >>>how
> > >>> >>it
> > >>> >> >improves things.  Using the Mahout primitives can show
that it's
> > >>>not
> > >>> >>about
> > >>> >> >the inner loop implementation, but the framework itself...
> > >>> >> >
> > >>> >> >
> > >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
> > >>> >> ><apache.mailbox@gmail.com>wrote:
> > >>> >> >
> > >>> >> >> I like the idea of refactoring it into something
more
> appropriate
> > >>> >>for us
> > >>> >> >> and ditching the Mahout dep. Good looking out.
> > >>> >> >>
> > >>> >> >>
> > >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella
<
> > >>> >> >> claudio.martella@gmail.com> wrote:
> > >>> >> >>
> > >>> >> >> > I agree, at this point we could have a RandomWalkVertex
with
> > >>>edge
> > >>> >> >>values,
> > >>> >> >> > and a "null-edged" vertex for the PR benchmarks.
> > >>> >> >> > We make everybody happy and avoid code duplication.
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro
Presta
> > >>> >><alessandro@fb.com
> > >>> >> >> > >wrote:
> > >>> >> >> >
> > >>> >> >> > > Hi Gianmarco,
> > >>> >> >> > >
> > >>> >> >> > > Yes, there will be more efficient implementations.
> > >>> >> >> > > In the redesign I'm working on (GIRAPH-528),
there will be
> > >>>only
> > >>> >>one
> > >>> >> >> > Vertex
> > >>> >> >> > > class and edge storage is delegated to
a VertexEdges class.
> > >>> >> >> > > So far I'm adding some generic implementations
> > >>>(ByteArrayEdges,
> > >>> >> >> > > ArrayListEdges, HashMapEdges) that work
for all types, and
> > >>>some
> > >>> >> >> optimized
> > >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
> > >>> >> >> > >
> > >>> >> >> > > Do you specifically need edge values to
be float while the
> > >>>other
> > >>> >> >>types
> > >>> >> >> > are
> > >>> >> >> > > double?
> > >>> >> >> > > It seems to me it would make sense to change
> RandomWalkVertex
> > >>>to
> > >>> >>use
> > >>> >> >> > > double edge values instead, and avoid code
duplication
> (i.e.
> > >>> >>adding
> > >>> >> >>a
> > >>> >> >> > > LongFloatArrayEdges that's basically the
same). We're not
> > >>>Trove
> > >>> >> >>after
> > >>> >> >> > all.
> > >>> >> >> > > Makes sense?
> > >>> >> >> > >
> > >>> >> >> > > Thanks for the feedback,
> > >>> >> >> > >
> > >>> >> >> > > Alessandro
> > >>> >> >> > >
> > >>> >> >> > >
> > >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci
Morales"
> > >>> >> >><gdfm@apache.org>
> > >>> >> >> > > wrote:
> > >>> >> >> > >
> > >>> >> >> > > >Hi,
> > >>> >> >> > > >
> > >>> >> >> > > >Maybe the specific implementation can
be thrown away, but
> > >>> >> >>personally I
> > >>> >> >> > > >feel
> > >>> >> >> > > >very strongly for the need of a good
LongDoubleFloatDouble
> > >>> >>vertex.
> > >>> >> >> > > >It's the base for any serious random
walk algorithm.
> > >>> >> >> > > >
> > >>> >> >> > > >I would call for a refactoring rather
than a removal.
> > >>> >> >> > > >
> > >>> >> >> > > >Just my 2c.
> > >>> >> >> > > >
> > >>> >> >> > > >Cheers,
> > >>> >> >> > > >
> > >>> >> >> > > >--
> > >>> >> >> > > >Gianmarco
> > >>> >> >> > > >
> > >>> >> >> > > >
> > >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro
Presta
> > >>> >> >> > > ><alessandro@fb.com>wrote:
> > >>> >> >> > > >
> > >>> >> >> > > >> Hi all,
> > >>> >> >> > > >>
> > >>> >> >> > > >> Does anyone feel strongly for
> LongDoubleFloatDoubleVertex?
> > >>> >> >> > > >> Reasons why I think it should
be removed:
> > >>> >> >> > > >>
> > >>> >> >> > > >>   1.  Right now it's incorrect
(returns target vertex id
> > >>>as
> > >>> >>edge
> > >>> >> >> > value).
> > >>> >> >> > > >>   2.  Iteration will always be
inefficient, since the
> > >>> >>underlying
> > >>> >> >> > Mahout
> > >>> >> >> > > >> open-addressing hash map implementation
doesn't provide
> > >>> >> >>iterators.
> > >>> >> >> It
> > >>> >> >> > > >> provides a way to copy the keys
and values to external
> > >>> >> >>arrays/lists.
> > >>> >> >> > > >>   3.  It's the only reason why
we have Mahout as a
> > >>>dependency.
> > >>> >> >> > > >>
> > >>> >> >> > > >> I think we should strive to provide
model
> implementations
> > >>>that
> > >>> >> >>are
> > >>> >> >> > > >>generic
> > >>> >> >> > > >> and/or extremely efficient. This
one satisfies neither.
> > >>> >> >> > > >>
> > >>> >> >> > > >> Thanks,
> > >>> >> >> > > >>
> > >>> >> >> > > >> Alessandro
> > >>> >> >> > > >>
> > >>> >> >> > >
> > >>> >> >> > >
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> > --
> > >>> >> >> >    Claudio Martella
> > >>> >> >> >    claudio.martella@gmail.com
> > >>> >> >> >
> > >>> >> >>
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> >--
> > >>> >> >
> > >>> >> >  -jake
> > >>> >>
> > >>> >>
> > >>> >
> > >>> >
> > >>> >--
> > >>> >
> > >>> >  -jake
> > >>>
> > >>>
> > >>
> > >>
> > >>--
> > >>
> > >>  -jake
> > >
> >
> >
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message