giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Reisman <apache.mail...@gmail.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Fri, 01 Mar 2013 20:47:00 GMT
That is true, its not the end of the world to make the optimized edge store
impl's throw exceptions when you try to mutate. As long as object-oriented
edge stores are possible for the user, they have the option to trade
efficiency or resources for mutation if they need it (or choose.)

One thing occured to me: if we ever get GIRAPH-26 up and running, Jake had
mentioned we would need the Mahout math libs to do it (Someone mentioned on
that thread that the Colt libs will not work for us I think.) so thats one
reason to keep depending on Mahout in some form.


On Fri, Mar 1, 2013 at 10:53 AM, Alessandro Presta <alessandro@fb.com>wrote:

> On 3/1/13 1:01 AM, "Claudio Martella" <claudio.martella@gmail.com> wrote:
>
> >Imagine you're infering an "is_a" property. e.g. you have lion - is_a ->
> >mammal - is_a -> animal. by propagation you'd infer lion - is_a -> animal.
> >Clearly, i'm talking about a graph that does not only have is_a typed
> >edges.
> >
> >Two things can be noted: (1) one could filter out non is_a edges directly
> >at loading in this case but (2) you'd want more complex of these, e.g. a
> >path (a concatenation of is_a with lives_in or something?) to find out at
> >query time or something.
>
> What would be the vertex id in this case? I don't think it would be "is_a"
> or "lives_in", right? That would probably be in the edge value.
> Also, wouldn't you still need to read all the edges for this type of
> application?
>
> >
> >This is one example. I think most of our algorithms out there touch all
> >the
> >edges and iterate over (why else using such a big framework and loading
> >the
> >whole graph at first place?), so I'd optimise for them or avoid penalising
> >them, but I would not remove from the API random access to edges either.
> >After all, the idea is to have them implement their (to come) Edges class.
> >
>
> The problem is that with every method we add to the API we increase the
> burden on the user.
> Also, more requirements = more constraints = less opportunities for
> optimization.
>
> getEdgeValue() poses a few different issues:
> - it will be slow in most implementations
> - if we want it to at least be as space-efficient as getEdges(), it would
> have to reuse the E object
> - the signature for general graphs would have to be Iterable<E>
> getEdgeValues(I); in order to have an easier interface for strict graphs,
> we would have to make the inheritance structure more complex (adding
> something like a StrictVertex)
>
> That said, nothing prevents the user from adding that functionality
> himself to VertexEdges, and then in Vertex casting getEdges() to the known
> VertexEdges implementation, and call whatever method (getEdgeValue(),
> whatnot) on it.
>
> >
> >On Fri, Mar 1, 2013 at 9:37 AM, Alessandro Presta <alessandro@fb.com>
> >wrote:
> >
> >> On 3/1/13 12:33 AM, "Claudio Martella" <claudio.martella@gmail.com>
> >>wrote:
> >>
> >> >On Fri, Mar 1, 2013 at 5:13 AM, Alessandro Presta <alessandro@fb.com>
> >> >wrote:
> >> >
> >> >> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
> >> >>
> >> >> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
> >> >> ><alessandro@fb.com>wrote:
> >> >> >
> >> >> >> It's not like it causes problems, it's just that it's a pretty
big
> >> >> >> dependency to justify for a small example.
> >> >> >>
> >> >> >> As for the motivation, if your point is to prove the framework's
> >> >> >> superiority in some context, then you can use the simplest
> >>possible
> >> >> >> implementation (ArrayList).
> >> >> >>
> >> >> >
> >> >> >This takes up LOTS of memory.  Primitives rule, objects drool
> >>(memory).
> >> >>
> >> >> Ok, but having to copy the keys and values to external arrays (which
> >>is
> >> >> what you have to do with Mahout's hash map) is even worse. A good
> >> >> implementation of (long, double) edges (e.g. for RandomWalkVertex)
is
> >> >> primitive arrays.
> >> >>
> >> >> >
> >> >> >
> >> >> >> The Giraph framework is all about iterating over edges, so
an
> >> >> >> implementation that doesn't support that with reasonable
> >>efficiency
> >> >> >> doesn't make a lot of sense to me.
> >> >> >>
> >> >> >> It also follows that hash map-based implementations only make
> >>sense
> >> >>for
> >> >> >> algorithms that make use of mutations.
> >> >> >>
> >> >> >
> >> >> >Agreed, maps aren't absolutely necessary for immutable graphs.
 But
> >> >>random
> >> >> >access to collections of integers _is_ necessary for many
> >>algorithms.
> >> >> >It's
> >> >> >not all just iteration over lists.
> >> >>
> >> >> Can you give me some concrete examples of algorithms where random
> >>access
> >> >> to the edges is required?
> >> >> I'm really interested in this, because I'm considering killing
> >> >> getEdgeValue() if there are no use cases.
> >> >>
> >> >
> >> >My example would be typed/labeled graphs (e.g. RDF) for reasoning and
> >> >traversals based on particular patterns/path.
> >> >
> >>
> >> Can you be a bit more specific? I'm interested in knowing if this
> >> algorithm would call getEdgeValue(id) for some target vertex id.
> >>
> >> >
> >> >>
> >> >> >
> >> >> >
> >> >> >> That said, something like a Trove hash map would probably
be more
> >> >> >> appropriate (more efficient than the standard Java HashMap,
at the
> >> >> >>expense
> >> >> >> of generality). That could be a good candidate for a
> >> >> >> LongDoubleHashMapEdges implementation.
> >> >> >> I can give that a shot if it sounds good.
> >> >> >>
> >> >> >
> >> >> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
> >> >>
> >> >> Whoops, totally missed that part.
> >> >>
> >> >> >
> >> >> >What does Giraph depend on of Mahout now?  Just mahout-collections?
> >> >> >That's
> >> >> >not a very big dependency, and has all sorts of
> >>primitive-to-primitive
> >> >> >collections,
> >> >> >and from what I've seen benchmarked, just as good or better than
> >>Trove.
> >> >> > Carrot2's
> >> >> >hppc may be better yet, but I'm not sure if that is stably released
> >> >>yet.
> >> >>
> >> >> I'll take a look at HPPC and Mahout collections then. They all seem
> >>to
> >> >> provide the same stuff, so I'll consider benchmarks and convenience
> >>of
> >> >>the
> >> >> API. Thanks for the pointers.
> >> >>
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
> >> >> >>
> >> >> >> >Is the mahout dependency causing problems?
> >> >> >> >
> >> >> >> >It would be nice if we could actually implement some of
the
> >> >>algorithms
> >> >> >> >that
> >> >> >> >Mahout does via map-reduce in Giraph's BSP formalism,
to show off
> >> >>how
> >> >> >>it
> >> >> >> >improves things.  Using the Mahout primitives can show
that it's
> >>not
> >> >> >>about
> >> >> >> >the inner loop implementation, but the framework itself...
> >> >> >> >
> >> >> >> >
> >> >> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
> >> >> >> ><apache.mailbox@gmail.com>wrote:
> >> >> >> >
> >> >> >> >> I like the idea of refactoring it into something
more
> >>appropriate
> >> >> >>for us
> >> >> >> >> and ditching the Mahout dep. Good looking out.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella
<
> >> >> >> >> claudio.martella@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >> > I agree, at this point we could have a RandomWalkVertex
with
> >> >>edge
> >> >> >> >>values,
> >> >> >> >> > and a "null-edged" vertex for the PR benchmarks.
> >> >> >> >> > We make everybody happy and avoid code duplication.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro
Presta
> >> >> >><alessandro@fb.com
> >> >> >> >> > >wrote:
> >> >> >> >> >
> >> >> >> >> > > Hi Gianmarco,
> >> >> >> >> > >
> >> >> >> >> > > Yes, there will be more efficient implementations.
> >> >> >> >> > > In the redesign I'm working on (GIRAPH-528),
there will be
> >> >>only
> >> >> >>one
> >> >> >> >> > Vertex
> >> >> >> >> > > class and edge storage is delegated to
a VertexEdges class.
> >> >> >> >> > > So far I'm adding some generic implementations
> >> >>(ByteArrayEdges,
> >> >> >> >> > > ArrayListEdges, HashMapEdges) that work
for all types, and
> >> >>some
> >> >> >> >> optimized
> >> >> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
> >> >> >> >> > >
> >> >> >> >> > > Do you specifically need edge values to
be float while the
> >> >>other
> >> >> >> >>types
> >> >> >> >> > are
> >> >> >> >> > > double?
> >> >> >> >> > > It seems to me it would make sense to change
> >>RandomWalkVertex
> >> >>to
> >> >> >>use
> >> >> >> >> > > double edge values instead, and avoid code
duplication
> >>(i.e.
> >> >> >>adding
> >> >> >> >>a
> >> >> >> >> > > LongFloatArrayEdges that's basically the
same). We're not
> >> >>Trove
> >> >> >> >>after
> >> >> >> >> > all.
> >> >> >> >> > > Makes sense?
> >> >> >> >> > >
> >> >> >> >> > > Thanks for the feedback,
> >> >> >> >> > >
> >> >> >> >> > > Alessandro
> >> >> >> >> > >
> >> >> >> >> > >
> >> >> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci
Morales"
> >> >> >> >><gdfm@apache.org>
> >> >> >> >> > > wrote:
> >> >> >> >> > >
> >> >> >> >> > > >Hi,
> >> >> >> >> > > >
> >> >> >> >> > > >Maybe the specific implementation can
be thrown away, but
> >> >> >> >>personally I
> >> >> >> >> > > >feel
> >> >> >> >> > > >very strongly for the need of a good
LongDoubleFloatDouble
> >> >> >>vertex.
> >> >> >> >> > > >It's the base for any serious random
walk algorithm.
> >> >> >> >> > > >
> >> >> >> >> > > >I would call for a refactoring rather
than a removal.
> >> >> >> >> > > >
> >> >> >> >> > > >Just my 2c.
> >> >> >> >> > > >
> >> >> >> >> > > >Cheers,
> >> >> >> >> > > >
> >> >> >> >> > > >--
> >> >> >> >> > > >Gianmarco
> >> >> >> >> > > >
> >> >> >> >> > > >
> >> >> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro
Presta
> >> >> >> >> > > ><alessandro@fb.com>wrote:
> >> >> >> >> > > >
> >> >> >> >> > > >> Hi all,
> >> >> >> >> > > >>
> >> >> >> >> > > >> Does anyone feel strongly for
> >>LongDoubleFloatDoubleVertex?
> >> >> >> >> > > >> Reasons why I think it should
be removed:
> >> >> >> >> > > >>
> >> >> >> >> > > >>   1.  Right now it's incorrect
(returns target vertex
> >>id as
> >> >> >>edge
> >> >> >> >> > value).
> >> >> >> >> > > >>   2.  Iteration will always be
inefficient, since the
> >> >> >>underlying
> >> >> >> >> > Mahout
> >> >> >> >> > > >> open-addressing hash map implementation
doesn't provide
> >> >> >> >>iterators.
> >> >> >> >> It
> >> >> >> >> > > >> provides a way to copy the keys
and values to external
> >> >> >> >>arrays/lists.
> >> >> >> >> > > >>   3.  It's the only reason why
we have Mahout as a
> >> >>dependency.
> >> >> >> >> > > >>
> >> >> >> >> > > >> I think we should strive to provide
model
> >>implementations
> >> >>that
> >> >> >> >>are
> >> >> >> >> > > >>generic
> >> >> >> >> > > >> and/or extremely efficient. This
one satisfies neither.
> >> >> >> >> > > >>
> >> >> >> >> > > >> Thanks,
> >> >> >> >> > > >>
> >> >> >> >> > > >> Alessandro
> >> >> >> >> > > >>
> >> >> >> >> > >
> >> >> >> >> > >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> >    Claudio Martella
> >> >> >> >> >    claudio.martella@gmail.com
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >--
> >> >> >> >
> >> >> >> >  -jake
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >--
> >> >> >
> >> >> >  -jake
> >> >>
> >> >>
> >> >
> >> >
> >> >--
> >> >   Claudio Martella
> >> >   claudio.martella@gmail.com
> >>
> >>
> >
> >
> >--
> >   Claudio Martella
> >   claudio.martella@gmail.com
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message