giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Sat, 02 Mar 2013 00:12:05 GMT
Sorry to have missed out on some of this conversation - had some work stuff
interrupt (how dare it!)

On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta <alessandro@fb.com>wrote:

> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>
> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
> ><alessandro@fb.com>wrote:
> >
> >> It's not like it causes problems, it's just that it's a pretty big
> >> dependency to justify for a small example.
> >>
> >> As for the motivation, if your point is to prove the framework's
> >> superiority in some context, then you can use the simplest possible
> >> implementation (ArrayList).
> >>
> >
> >This takes up LOTS of memory.  Primitives rule, objects drool (memory).
>
> Ok, but having to copy the keys and values to external arrays (which is
> what you have to do with Mahout's hash map) is even worse. A good
> implementation of (long, double) edges (e.g. for RandomWalkVertex) is
> primitive arrays.
>

So it is certainly true that the *typical* usage of iteration over Mahout
maps
is by doing this copy of internal values (which is not efficient, no), it's
not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
passes in a callback which operates directly on the underlying primitive
arrays.


>
> >
> >
> >> The Giraph framework is all about iterating over edges, so an
> >> implementation that doesn't support that with reasonable efficiency
> >> doesn't make a lot of sense to me.
> >>
> >> It also follows that hash map-based implementations only make sense for
> >> algorithms that make use of mutations.
> >>
> >
> >Agreed, maps aren't absolutely necessary for immutable graphs.  But random
> >access to collections of integers _is_ necessary for many algorithms.
> >It's
> >not all just iteration over lists.
>
> Can you give me some concrete examples of algorithms where random access
> to the edges is required?
> I'm really interested in this, because I'm considering killing
> getEdgeValue() if there are no use cases.
>

So I'm not a graph person, I think in terms of matrices, but since every
graph
has an adjecency matrix, and every matrix is the adjacency matrix of some
(possibly
bipartite) graph, lets pretend they're basically the same:

If you load a graph into Giraph, and want to compute the matrix product of
this
graph with collection of input vectors, then you may want to take each of
these
input vectors and compute their dot products with the vertices of the graph
(to
mix my metaphors: each vertex is essentially a row or column of the matrix),
which can be done by either iterating over the input vector and looking up
randomly for nonzero entries in the vertex, or the reverse.

I'm not sure what this would correspond to in graphland, but I can
certainly see
wanting to have a big in-memory matrix which you can compute dot products
of vectors with each row of it, and Giraph would do this very efficiently
(but some
of the implementations of this would assume you had random access to the
edges (meaning edgeId and edgeWeight of each vertex).


>
> >
> >
> >> That said, something like a Trove hash map would probably be more
> >> appropriate (more efficient than the standard Java HashMap, at the
> >>expense
> >> of generality). That could be a good candidate for a
> >> LongDoubleHashMapEdges implementation.
> >> I can give that a shot if it sounds good.
> >>
> >
> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>
> Whoops, totally missed that part.
>
> >
> >What does Giraph depend on of Mahout now?  Just mahout-collections?
> >That's
> >not a very big dependency, and has all sorts of primitive-to-primitive
> >collections,
> >and from what I've seen benchmarked, just as good or better than Trove.
> > Carrot2's
> >hppc may be better yet, but I'm not sure if that is stably released yet.
>
> I'll take a look at HPPC and Mahout collections then. They all seem to
> provide the same stuff, so I'll consider benchmarks and convenience of the
> API. Thanks for the pointers.
>
> >
> >
> >>
> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> >>
> >> >Is the mahout dependency causing problems?
> >> >
> >> >It would be nice if we could actually implement some of the algorithms
> >> >that
> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off how
> >>it
> >> >improves things.  Using the Mahout primitives can show that it's not
> >>about
> >> >the inner loop implementation, but the framework itself...
> >> >
> >> >
> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
> >> ><apache.mailbox@gmail.com>wrote:
> >> >
> >> >> I like the idea of refactoring it into something more appropriate
> >>for us
> >> >> and ditching the Mahout dep. Good looking out.
> >> >>
> >> >>
> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella <
> >> >> claudio.martella@gmail.com> wrote:
> >> >>
> >> >> > I agree, at this point we could have a RandomWalkVertex with edge
> >> >>values,
> >> >> > and a "null-edged" vertex for the PR benchmarks.
> >> >> > We make everybody happy and avoid code duplication.
> >> >> >
> >> >> >
> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
> >><alessandro@fb.com
> >> >> > >wrote:
> >> >> >
> >> >> > > Hi Gianmarco,
> >> >> > >
> >> >> > > Yes, there will be more efficient implementations.
> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be
only
> >>one
> >> >> > Vertex
> >> >> > > class and edge storage is delegated to a VertexEdges class.
> >> >> > > So far I'm adding some generic implementations (ByteArrayEdges,
> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and
some
> >> >> optimized
> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
> >> >> > >
> >> >> > > Do you specifically need edge values to be float while the
other
> >> >>types
> >> >> > are
> >> >> > > double?
> >> >> > > It seems to me it would make sense to change RandomWalkVertex
to
> >>use
> >> >> > > double edge values instead, and avoid code duplication (i.e.
> >>adding
> >> >>a
> >> >> > > LongFloatArrayEdges that's basically the same). We're not
Trove
> >> >>after
> >> >> > all.
> >> >> > > Makes sense?
> >> >> > >
> >> >> > > Thanks for the feedback,
> >> >> > >
> >> >> > > Alessandro
> >> >> > >
> >> >> > >
> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales"
> >> >><gdfm@apache.org>
> >> >> > > wrote:
> >> >> > >
> >> >> > > >Hi,
> >> >> > > >
> >> >> > > >Maybe the specific implementation can be thrown away,
but
> >> >>personally I
> >> >> > > >feel
> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble
> >>vertex.
> >> >> > > >It's the base for any serious random walk algorithm.
> >> >> > > >
> >> >> > > >I would call for a refactoring rather than a removal.
> >> >> > > >
> >> >> > > >Just my 2c.
> >> >> > > >
> >> >> > > >Cheers,
> >> >> > > >
> >> >> > > >--
> >> >> > > >Gianmarco
> >> >> > > >
> >> >> > > >
> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta
> >> >> > > ><alessandro@fb.com>wrote:
> >> >> > > >
> >> >> > > >> Hi all,
> >> >> > > >>
> >> >> > > >> Does anyone feel strongly for LongDoubleFloatDoubleVertex?
> >> >> > > >> Reasons why I think it should be removed:
> >> >> > > >>
> >> >> > > >>   1.  Right now it's incorrect (returns target vertex
id as
> >>edge
> >> >> > value).
> >> >> > > >>   2.  Iteration will always be inefficient, since
the
> >>underlying
> >> >> > Mahout
> >> >> > > >> open-addressing hash map implementation doesn't
provide
> >> >>iterators.
> >> >> It
> >> >> > > >> provides a way to copy the keys and values to external
> >> >>arrays/lists.
> >> >> > > >>   3.  It's the only reason why we have Mahout as
a dependency.
> >> >> > > >>
> >> >> > > >> I think we should strive to provide model implementations
that
> >> >>are
> >> >> > > >>generic
> >> >> > > >> and/or extremely efficient. This one satisfies neither.
> >> >> > > >>
> >> >> > > >> Thanks,
> >> >> > > >>
> >> >> > > >> Alessandro
> >> >> > > >>
> >> >> > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> > --
> >> >> >    Claudio Martella
> >> >> >    claudio.martella@gmail.com
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >
> >> >  -jake
> >>
> >>
> >
> >
> >--
> >
> >  -jake
>
>


-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message