giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Tue, 05 Mar 2013 00:44:18 GMT
We could *shudder* write templates for basic abstract primitive Vertex
classes (Abstract$KeyType$ValueTypeVertex.java.t ...) a la Mahout
collections...


On Mon, Mar 4, 2013 at 3:03 PM, Claudio Martella <claudio.martella@gmail.com
> wrote:

> I guess we could do some crazy reflection code and get the types as
> parameters... i also do feel that sometimes Long ids is overkill, for
> example.
>
>
> On Mon, Mar 4, 2013 at 11:52 PM, Alessandro Presta <alessandro@fb.com
> >wrote:
>
> >
> >
> > On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <gdfm@apache.org>
> > wrote:
> >
> > >The interface you propose sounds very reasonable. +1 for that.
> > >
> > >For floats vs double, they consume much less memory, O(|E|) reduction,
> and
> > >there is basically never any need to have a double precision on edges
> > >because they are immutable.
> > >Anyway it's not a strong constraint, double edges would work as well.
> >
> > Sure, makes sense. The question here is whether we should be providing
> all
> > possible combinations (basically duplicating a lot of code), or let the
> > user copy our model implementations.
> > If there is enough interest, we can always add all combinations of
> > (Long/Int, Double/Float/Null, Array/HashMap), although it would be tiring
> > and still not comprehensive (what about String ids?).
> >
> > >
> > >Cheers,
> > >
> > >--
> > >Gianmarco
> > >
> > >
> > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <alessandro@fb.com>
> > >wrote:
> > >
> > >> Maja and I have thought of a way we could include this type of
> > >> functionality without it being burdensome of the user.
> > >>
> > >> We could add the following interfaces:
> > >>
> > >> interface StrictRandomAccesVertexEdges extends VertexEdges {
> > >>   E getEdgeValue(I)
> > >> }
> > >>
> > >> interface MultiRandomAccessVertexEdges extends VertexEdges {
> > >>   Iterable<E> getAllEdgeValues(I)
> > >> }
> > >>
> > >> We could add the following to the Vertex interface:
> > >>
> > >> // Returns the first edge value for a given target vertex id (or null
> if
> > >> none)
> > >> E getEdgeValue(I targetVertexId) {
> > >>   // if we have an instance of StrictRandomAccessVertexEdges, we
> > >>delegate;
> > >>   // otherwise, we iterate over the edges to find the first one
> > >> }
> > >>
> > >> // Returns an Iterable over all edge values for a given target vertex
> id
> > >> (only useful for multigraphs)
> > >> Iterable<E> getAllEdgeValues(I targetVertexId) {
> > >>   // if we have an instance of MultiRandomAccessVertexEdges, we
> > >>delegate;
> > >>   // otherwise, we wrap the edges into an Iterable that filters for
> the
> > >> matching ones
> > >> }
> > >>
> > >> This way, only VertexEdges implementations where it makes sense would
> > >> define the random-access methods, and a default (slow!) implementation
> > >> would be available in all other cases.
> > >>
> > >> How does this sound? I could include it in GIRAPH-528.
> > >>
> > >> On 3/3/13 4:57 PM, "Alessandro Presta" <alessandro@fb.com> wrote:
> > >>
> > >> >I don't contest the graph-matrix duality, but I'm not yet convinced
> > >>that
> > >> >you need random access to the edges for the scenarios you mentioned.
> > >> >PageRank and label propagation, two of the most common applications
> of
> > >>our
> > >> >framework, have indeed a linear algebraic formulation, the bulk of
> > >>which
> > >> >is always matrix multiplication. The most natural and efficient (if
> not
> > >> >the only) implementations in our model involve simply iterating over
> > >>the
> > >> >edges.
> > >> >As for your "dot products" example, maybe it would help if you could
> > >>be a
> > >> >bit more specific: what are your vertices and edges? How do we get
> the
> > >> >"input vectors"?. Everything should be expressed in terms of vertex
> > >> >values, edges, and messages, otherwise it doesn't fit the current
> > >>Giraph
> > >> >model anyway.
> > >> >In any case, to compute a dot product efficiently you don't need
> random
> > >> >access to both vectors: if you have the non-zero coordinates of
> vector
> > >>A
> > >> >(equivalent of the edge list), and random access to vector B, you
> > >>iterate
> > >> >over A and lookup in B. This is efficient.
> > >> >
> > >> >> I'm not sure what this would correspond to in graphland, but I
can
> > >> >> certainly see
> > >> >> wanting to have a big in-memory matrix which you can compute dot
> > >> >>products
> > >> >> of vectors with each row of it, and Giraph would do this very
> > >> >>efficiently
> > >> >> (but some
> > >> >> of the implementations of this would assume you had random access
> to
> > >>the
> > >> >> edges (meaning edgeId and edgeWeight of each vertex).
> > >> >
> > >> >
> > >> >Here we are talking about a method with signature
> > >> >
> > >> >    EdgeValueType getEdgeValue(VertexIdType targetVertexId)
> > >> >
> > >> >or the multigraph version
> > >> >
> > >> >    Iterable<EdgeValueType> getEdgeValues(VertexIdType
> targetVertexId)
> > >> >
> > >> >I'm looking for examples of algorithms that can't be implemented in
> > >>Giraph
> > >> >because of the lack of this method, and would otherwise be a good
> fit.
> > >> >
> > >> >Having said that, I think there *is* still a case for hash-map-backed
> > >> >edges, and that is when we need to guarantee a "strict graph" (i.e.
> no
> > >> >parallel edges) or when our algorithm uses remote edge removal
> > >>requests.
> > >> >So I'm not against having an optimized LongDoubleHashMapEdges (and
> > >> >LongNullHashSetEdges for the unweighted case) included, provided that
> > >>it's
> > >> >correct and efficient.
> > >> >I will make sure I include it. Now it's a matter of picking a
> primitive
> > >> >collections library among HPPC, fastutil, Mahout, Trove, etc...
> > >> >
> > >> >On 3/1/13 4:12 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> > >> >
> > >> >>Sorry to have missed out on some of this conversation - had some
> work
> > >> >>stuff
> > >> >>interrupt (how dare it!)
> > >> >>
> > >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
> > >> >><alessandro@fb.com>wrote:
> > >> >>
> > >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
> > >> >>>
> > >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
> > >> >>> ><alessandro@fb.com>wrote:
> > >> >>> >
> > >> >>> >> It's not like it causes problems, it's just that
it's a pretty
> > >>big
> > >> >>> >> dependency to justify for a small example.
> > >> >>> >>
> > >> >>> >> As for the motivation, if your point is to prove
the
> framework's
> > >> >>> >> superiority in some context, then you can use the
simplest
> > >>possible
> > >> >>> >> implementation (ArrayList).
> > >> >>> >>
> > >> >>> >
> > >> >>> >This takes up LOTS of memory.  Primitives rule, objects
drool
> > >> >>>(memory).
> > >> >>>
> > >> >>> Ok, but having to copy the keys and values to external arrays
> > >>(which is
> > >> >>> what you have to do with Mahout's hash map) is even worse.
A good
> > >> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex)
> > >>is
> > >> >>> primitive arrays.
> > >> >>>
> > >> >>
> > >> >>So it is certainly true that the *typical* usage of iteration over
> > >>Mahout
> > >> >>maps
> > >> >>is by doing this copy of internal values (which is not efficient,
> no),
> > >> >>it's
> > >> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure
> p)
> > >> >>passes in a callback which operates directly on the underlying
> > >>primitive
> > >> >>arrays.
> > >> >>
> > >> >>
> > >> >>>
> > >> >>> >
> > >> >>> >
> > >> >>> >> The Giraph framework is all about iterating over
edges, so an
> > >> >>> >> implementation that doesn't support that with reasonable
> > >>efficiency
> > >> >>> >> doesn't make a lot of sense to me.
> > >> >>> >>
> > >> >>> >> It also follows that hash map-based implementations
only make
> > >>sense
> > >> >>>for
> > >> >>> >> algorithms that make use of mutations.
> > >> >>> >>
> > >> >>> >
> > >> >>> >Agreed, maps aren't absolutely necessary for immutable
graphs.
>  But
> > >> >>>random
> > >> >>> >access to collections of integers _is_ necessary for many
> > >>algorithms.
> > >> >>> >It's
> > >> >>> >not all just iteration over lists.
> > >> >>>
> > >> >>> Can you give me some concrete examples of algorithms where
random
> > >> >>>access
> > >> >>> to the edges is required?
> > >> >>> I'm really interested in this, because I'm considering killing
> > >> >>> getEdgeValue() if there are no use cases.
> > >> >>>
> > >> >>
> > >> >>So I'm not a graph person, I think in terms of matrices, but since
> > >>every
> > >> >>graph
> > >> >>has an adjecency matrix, and every matrix is the adjacency matrix
of
> > >>some
> > >> >>(possibly
> > >> >>bipartite) graph, lets pretend they're basically the same:
> > >> >>
> > >> >>If you load a graph into Giraph, and want to compute the matrix
> > >>product
> > >> >>of
> > >> >>this
> > >> >>graph with collection of input vectors, then you may want to take
> > >>each of
> > >> >>these
> > >> >>input vectors and compute their dot products with the vertices
of
> the
> > >> >>graph
> > >> >>(to
> > >> >>mix my metaphors: each vertex is essentially a row or column of
the
> > >> >>matrix),
> > >> >>which can be done by either iterating over the input vector and
> > >>looking
> > >> >>up
> > >> >>randomly for nonzero entries in the vertex, or the reverse.
> > >> >>
> > >> >>I'm not sure what this would correspond to in graphland, but I
can
> > >> >>certainly see
> > >> >>wanting to have a big in-memory matrix which you can compute dot
> > >>products
> > >> >>of vectors with each row of it, and Giraph would do this very
> > >>efficiently
> > >> >>(but some
> > >> >>of the implementations of this would assume you had random access
to
> > >>the
> > >> >>edges (meaning edgeId and edgeWeight of each vertex).
> > >> >>
> > >> >>
> > >> >>>
> > >> >>> >
> > >> >>> >
> > >> >>> >> That said, something like a Trove hash map would
probably be
> more
> > >> >>> >> appropriate (more efficient than the standard Java
HashMap, at
> > >>the
> > >> >>> >>expense
> > >> >>> >> of generality). That could be a good candidate for
a
> > >> >>> >> LongDoubleHashMapEdges implementation.
> > >> >>> >> I can give that a shot if it sounds good.
> > >> >>> >>
> > >> >>> >
> > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
> > >> >>>
> > >> >>> Whoops, totally missed that part.
> > >> >>>
> > >> >>> >
> > >> >>> >What does Giraph depend on of Mahout now?  Just
> mahout-collections?
> > >> >>> >That's
> > >> >>> >not a very big dependency, and has all sorts of
> > >>primitive-to-primitive
> > >> >>> >collections,
> > >> >>> >and from what I've seen benchmarked, just as good or better
than
> > >> >>>Trove.
> > >> >>> > Carrot2's
> > >> >>> >hppc may be better yet, but I'm not sure if that is stably
> released
> > >> >>>yet.
> > >> >>>
> > >> >>> I'll take a look at HPPC and Mahout collections then. They
all
> seem
> > >>to
> > >> >>> provide the same stuff, so I'll consider benchmarks and
> convenience
> > >>of
> > >> >>>the
> > >> >>> API. Thanks for the pointers.
> > >> >>>
> > >> >>> >
> > >> >>> >
> > >> >>> >>
> > >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
> wrote:
> > >> >>> >>
> > >> >>> >> >Is the mahout dependency causing problems?
> > >> >>> >> >
> > >> >>> >> >It would be nice if we could actually implement
some of the
> > >> >>>algorithms
> > >> >>> >> >that
> > >> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism,
to show
> > >>off
> > >> >>>how
> > >> >>> >>it
> > >> >>> >> >improves things.  Using the Mahout primitives
can show that
> it's
> > >> >>>not
> > >> >>> >>about
> > >> >>> >> >the inner loop implementation, but the framework
itself...
> > >> >>> >> >
> > >> >>> >> >
> > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
> > >> >>> >> ><apache.mailbox@gmail.com>wrote:
> > >> >>> >> >
> > >> >>> >> >> I like the idea of refactoring it into something
more
> > >>appropriate
> > >> >>> >>for us
> > >> >>> >> >> and ditching the Mahout dep. Good looking
out.
> > >> >>> >> >>
> > >> >>> >> >>
> > >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio
Martella <
> > >> >>> >> >> claudio.martella@gmail.com> wrote:
> > >> >>> >> >>
> > >> >>> >> >> > I agree, at this point we could have
a RandomWalkVertex
> with
> > >> >>>edge
> > >> >>> >> >>values,
> > >> >>> >> >> > and a "null-edged" vertex for the PR
benchmarks.
> > >> >>> >> >> > We make everybody happy and avoid code
duplication.
> > >> >>> >> >> >
> > >> >>> >> >> >
> > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro
Presta
> > >> >>> >><alessandro@fb.com
> > >> >>> >> >> > >wrote:
> > >> >>> >> >> >
> > >> >>> >> >> > > Hi Gianmarco,
> > >> >>> >> >> > >
> > >> >>> >> >> > > Yes, there will be more efficient
implementations.
> > >> >>> >> >> > > In the redesign I'm working on
(GIRAPH-528), there will
> be
> > >> >>>only
> > >> >>> >>one
> > >> >>> >> >> > Vertex
> > >> >>> >> >> > > class and edge storage is delegated
to a VertexEdges
> > >>class.
> > >> >>> >> >> > > So far I'm adding some generic
implementations
> > >> >>>(ByteArrayEdges,
> > >> >>> >> >> > > ArrayListEdges, HashMapEdges)
that work for all types,
> and
> > >> >>>some
> > >> >>> >> >> optimized
> > >> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
> > >> >>> >> >> > >
> > >> >>> >> >> > > Do you specifically need edge
values to be float while
> the
> > >> >>>other
> > >> >>> >> >>types
> > >> >>> >> >> > are
> > >> >>> >> >> > > double?
> > >> >>> >> >> > > It seems to me it would make sense
to change
> > >>RandomWalkVertex
> > >> >>>to
> > >> >>> >>use
> > >> >>> >> >> > > double edge values instead, and
avoid code duplication
> > >>(i.e.
> > >> >>> >>adding
> > >> >>> >> >>a
> > >> >>> >> >> > > LongFloatArrayEdges that's basically
the same). We're
> not
> > >> >>>Trove
> > >> >>> >> >>after
> > >> >>> >> >> > all.
> > >> >>> >> >> > > Makes sense?
> > >> >>> >> >> > >
> > >> >>> >> >> > > Thanks for the feedback,
> > >> >>> >> >> > >
> > >> >>> >> >> > > Alessandro
> > >> >>> >> >> > >
> > >> >>> >> >> > >
> > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco
De Francisci Morales"
> > >> >>> >> >><gdfm@apache.org>
> > >> >>> >> >> > > wrote:
> > >> >>> >> >> > >
> > >> >>> >> >> > > >Hi,
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >Maybe the specific implementation
can be thrown away,
> but
> > >> >>> >> >>personally I
> > >> >>> >> >> > > >feel
> > >> >>> >> >> > > >very strongly for the need
of a good
> > >>LongDoubleFloatDouble
> > >> >>> >>vertex.
> > >> >>> >> >> > > >It's the base for any serious
random walk algorithm.
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >I would call for a refactoring
rather than a removal.
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >Just my 2c.
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >Cheers,
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >--
> > >> >>> >> >> > > >Gianmarco
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54
AM, Alessandro Presta
> > >> >>> >> >> > > ><alessandro@fb.com>wrote:
> > >> >>> >> >> > > >
> > >> >>> >> >> > > >> Hi all,
> > >> >>> >> >> > > >>
> > >> >>> >> >> > > >> Does anyone feel strongly
for
> > >>LongDoubleFloatDoubleVertex?
> > >> >>> >> >> > > >> Reasons why I think it
should be removed:
> > >> >>> >> >> > > >>
> > >> >>> >> >> > > >>   1.  Right now it's
incorrect (returns target vertex
> > >>id
> > >> >>>as
> > >> >>> >>edge
> > >> >>> >> >> > value).
> > >> >>> >> >> > > >>   2.  Iteration will
always be inefficient, since the
> > >> >>> >>underlying
> > >> >>> >> >> > Mahout
> > >> >>> >> >> > > >> open-addressing hash
map implementation doesn't
> provide
> > >> >>> >> >>iterators.
> > >> >>> >> >> It
> > >> >>> >> >> > > >> provides a way to copy
the keys and values to
> external
> > >> >>> >> >>arrays/lists.
> > >> >>> >> >> > > >>   3.  It's the only reason
why we have Mahout as a
> > >> >>>dependency.
> > >> >>> >> >> > > >>
> > >> >>> >> >> > > >> I think we should strive
to provide model
> > >>implementations
> > >> >>>that
> > >> >>> >> >>are
> > >> >>> >> >> > > >>generic
> > >> >>> >> >> > > >> and/or extremely efficient.
This one satisfies
> neither.
> > >> >>> >> >> > > >>
> > >> >>> >> >> > > >> Thanks,
> > >> >>> >> >> > > >>
> > >> >>> >> >> > > >> Alessandro
> > >> >>> >> >> > > >>
> > >> >>> >> >> > >
> > >> >>> >> >> > >
> > >> >>> >> >> >
> > >> >>> >> >> >
> > >> >>> >> >> > --
> > >> >>> >> >> >    Claudio Martella
> > >> >>> >> >> >    claudio.martella@gmail.com
> > >> >>> >> >> >
> > >> >>> >> >>
> > >> >>> >> >
> > >> >>> >> >
> > >> >>> >> >
> > >> >>> >> >--
> > >> >>> >> >
> > >> >>> >> >  -jake
> > >> >>> >>
> > >> >>> >>
> > >> >>> >
> > >> >>> >
> > >> >>> >--
> > >> >>> >
> > >> >>> >  -jake
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >> >>--
> > >> >>
> > >> >>  -jake
> > >> >
> > >>
> > >>
> >
> >
>
>
> --
>    Claudio Martella
>    claudio.martella@gmail.com
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message