giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Presta <alessan...@fb.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Fri, 01 Mar 2013 18:53:45 GMT
On 3/1/13 1:01 AM, "Claudio Martella" <claudio.martella@gmail.com> wrote:

>Imagine you're infering an "is_a" property. e.g. you have lion - is_a ->
>mammal - is_a -> animal. by propagation you'd infer lion - is_a -> animal.
>Clearly, i'm talking about a graph that does not only have is_a typed
>edges.
>
>Two things can be noted: (1) one could filter out non is_a edges directly
>at loading in this case but (2) you'd want more complex of these, e.g. a
>path (a concatenation of is_a with lives_in or something?) to find out at
>query time or something.

What would be the vertex id in this case? I don't think it would be "is_a"
or "lives_in", right? That would probably be in the edge value.
Also, wouldn't you still need to read all the edges for this type of
application?

>
>This is one example. I think most of our algorithms out there touch all
>the
>edges and iterate over (why else using such a big framework and loading
>the
>whole graph at first place?), so I'd optimise for them or avoid penalising
>them, but I would not remove from the API random access to edges either.
>After all, the idea is to have them implement their (to come) Edges class.
>

The problem is that with every method we add to the API we increase the
burden on the user.
Also, more requirements = more constraints = less opportunities for
optimization.

getEdgeValue() poses a few different issues:
- it will be slow in most implementations
- if we want it to at least be as space-efficient as getEdges(), it would
have to reuse the E object
- the signature for general graphs would have to be Iterable<E>
getEdgeValues(I); in order to have an easier interface for strict graphs,
we would have to make the inheritance structure more complex (adding
something like a StrictVertex)

That said, nothing prevents the user from adding that functionality
himself to VertexEdges, and then in Vertex casting getEdges() to the known
VertexEdges implementation, and call whatever method (getEdgeValue(),
whatnot) on it.

>
>On Fri, Mar 1, 2013 at 9:37 AM, Alessandro Presta <alessandro@fb.com>
>wrote:
>
>> On 3/1/13 12:33 AM, "Claudio Martella" <claudio.martella@gmail.com>
>>wrote:
>>
>> >On Fri, Mar 1, 2013 at 5:13 AM, Alessandro Presta <alessandro@fb.com>
>> >wrote:
>> >
>> >> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>> >>
>> >> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>> >> ><alessandro@fb.com>wrote:
>> >> >
>> >> >> It's not like it causes problems, it's just that it's a pretty
big
>> >> >> dependency to justify for a small example.
>> >> >>
>> >> >> As for the motivation, if your point is to prove the framework's
>> >> >> superiority in some context, then you can use the simplest
>>possible
>> >> >> implementation (ArrayList).
>> >> >>
>> >> >
>> >> >This takes up LOTS of memory.  Primitives rule, objects drool
>>(memory).
>> >>
>> >> Ok, but having to copy the keys and values to external arrays (which
>>is
>> >> what you have to do with Mahout's hash map) is even worse. A good
>> >> implementation of (long, double) edges (e.g. for RandomWalkVertex) is
>> >> primitive arrays.
>> >>
>> >> >
>> >> >
>> >> >> The Giraph framework is all about iterating over edges, so an
>> >> >> implementation that doesn't support that with reasonable
>>efficiency
>> >> >> doesn't make a lot of sense to me.
>> >> >>
>> >> >> It also follows that hash map-based implementations only make
>>sense
>> >>for
>> >> >> algorithms that make use of mutations.
>> >> >>
>> >> >
>> >> >Agreed, maps aren't absolutely necessary for immutable graphs.  But
>> >>random
>> >> >access to collections of integers _is_ necessary for many
>>algorithms.
>> >> >It's
>> >> >not all just iteration over lists.
>> >>
>> >> Can you give me some concrete examples of algorithms where random
>>access
>> >> to the edges is required?
>> >> I'm really interested in this, because I'm considering killing
>> >> getEdgeValue() if there are no use cases.
>> >>
>> >
>> >My example would be typed/labeled graphs (e.g. RDF) for reasoning and
>> >traversals based on particular patterns/path.
>> >
>>
>> Can you be a bit more specific? I'm interested in knowing if this
>> algorithm would call getEdgeValue(id) for some target vertex id.
>>
>> >
>> >>
>> >> >
>> >> >
>> >> >> That said, something like a Trove hash map would probably be more
>> >> >> appropriate (more efficient than the standard Java HashMap, at
the
>> >> >>expense
>> >> >> of generality). That could be a good candidate for a
>> >> >> LongDoubleHashMapEdges implementation.
>> >> >> I can give that a shot if it sounds good.
>> >> >>
>> >> >
>> >> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>> >>
>> >> Whoops, totally missed that part.
>> >>
>> >> >
>> >> >What does Giraph depend on of Mahout now?  Just mahout-collections?
>> >> >That's
>> >> >not a very big dependency, and has all sorts of
>>primitive-to-primitive
>> >> >collections,
>> >> >and from what I've seen benchmarked, just as good or better than
>>Trove.
>> >> > Carrot2's
>> >> >hppc may be better yet, but I'm not sure if that is stably released
>> >>yet.
>> >>
>> >> I'll take a look at HPPC and Mahout collections then. They all seem
>>to
>> >> provide the same stuff, so I'll consider benchmarks and convenience
>>of
>> >>the
>> >> API. Thanks for the pointers.
>> >>
>> >> >
>> >> >
>> >> >>
>> >> >> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
>> >> >>
>> >> >> >Is the mahout dependency causing problems?
>> >> >> >
>> >> >> >It would be nice if we could actually implement some of the
>> >>algorithms
>> >> >> >that
>> >> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show
off
>> >>how
>> >> >>it
>> >> >> >improves things.  Using the Mahout primitives can show that
it's
>>not
>> >> >>about
>> >> >> >the inner loop implementation, but the framework itself...
>> >> >> >
>> >> >> >
>> >> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>> >> >> ><apache.mailbox@gmail.com>wrote:
>> >> >> >
>> >> >> >> I like the idea of refactoring it into something more
>>appropriate
>> >> >>for us
>> >> >> >> and ditching the Mahout dep. Good looking out.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella <
>> >> >> >> claudio.martella@gmail.com> wrote:
>> >> >> >>
>> >> >> >> > I agree, at this point we could have a RandomWalkVertex
with
>> >>edge
>> >> >> >>values,
>> >> >> >> > and a "null-edged" vertex for the PR benchmarks.
>> >> >> >> > We make everybody happy and avoid code duplication.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
>> >> >><alessandro@fb.com
>> >> >> >> > >wrote:
>> >> >> >> >
>> >> >> >> > > Hi Gianmarco,
>> >> >> >> > >
>> >> >> >> > > Yes, there will be more efficient implementations.
>> >> >> >> > > In the redesign I'm working on (GIRAPH-528),
there will be
>> >>only
>> >> >>one
>> >> >> >> > Vertex
>> >> >> >> > > class and edge storage is delegated to a VertexEdges
class.
>> >> >> >> > > So far I'm adding some generic implementations
>> >>(ByteArrayEdges,
>> >> >> >> > > ArrayListEdges, HashMapEdges) that work for
all types, and
>> >>some
>> >> >> >> optimized
>> >> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
>> >> >> >> > >
>> >> >> >> > > Do you specifically need edge values to be float
while the
>> >>other
>> >> >> >>types
>> >> >> >> > are
>> >> >> >> > > double?
>> >> >> >> > > It seems to me it would make sense to change
>>RandomWalkVertex
>> >>to
>> >> >>use
>> >> >> >> > > double edge values instead, and avoid code duplication
>>(i.e.
>> >> >>adding
>> >> >> >>a
>> >> >> >> > > LongFloatArrayEdges that's basically the same).
We're not
>> >>Trove
>> >> >> >>after
>> >> >> >> > all.
>> >> >> >> > > Makes sense?
>> >> >> >> > >
>> >> >> >> > > Thanks for the feedback,
>> >> >> >> > >
>> >> >> >> > > Alessandro
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci
Morales"
>> >> >> >><gdfm@apache.org>
>> >> >> >> > > wrote:
>> >> >> >> > >
>> >> >> >> > > >Hi,
>> >> >> >> > > >
>> >> >> >> > > >Maybe the specific implementation can be
thrown away, but
>> >> >> >>personally I
>> >> >> >> > > >feel
>> >> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble
>> >> >>vertex.
>> >> >> >> > > >It's the base for any serious random walk
algorithm.
>> >> >> >> > > >
>> >> >> >> > > >I would call for a refactoring rather than
a removal.
>> >> >> >> > > >
>> >> >> >> > > >Just my 2c.
>> >> >> >> > > >
>> >> >> >> > > >Cheers,
>> >> >> >> > > >
>> >> >> >> > > >--
>> >> >> >> > > >Gianmarco
>> >> >> >> > > >
>> >> >> >> > > >
>> >> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro
Presta
>> >> >> >> > > ><alessandro@fb.com>wrote:
>> >> >> >> > > >
>> >> >> >> > > >> Hi all,
>> >> >> >> > > >>
>> >> >> >> > > >> Does anyone feel strongly for
>>LongDoubleFloatDoubleVertex?
>> >> >> >> > > >> Reasons why I think it should be removed:
>> >> >> >> > > >>
>> >> >> >> > > >>   1.  Right now it's incorrect (returns
target vertex
>>id as
>> >> >>edge
>> >> >> >> > value).
>> >> >> >> > > >>   2.  Iteration will always be inefficient,
since the
>> >> >>underlying
>> >> >> >> > Mahout
>> >> >> >> > > >> open-addressing hash map implementation
doesn't provide
>> >> >> >>iterators.
>> >> >> >> It
>> >> >> >> > > >> provides a way to copy the keys and
values to external
>> >> >> >>arrays/lists.
>> >> >> >> > > >>   3.  It's the only reason why we have
Mahout as a
>> >>dependency.
>> >> >> >> > > >>
>> >> >> >> > > >> I think we should strive to provide
model
>>implementations
>> >>that
>> >> >> >>are
>> >> >> >> > > >>generic
>> >> >> >> > > >> and/or extremely efficient. This one
satisfies neither.
>> >> >> >> > > >>
>> >> >> >> > > >> Thanks,
>> >> >> >> > > >>
>> >> >> >> > > >> Alessandro
>> >> >> >> > > >>
>> >> >> >> > >
>> >> >> >> > >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> >    Claudio Martella
>> >> >> >> >    claudio.martella@gmail.com
>> >> >> >> >
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >--
>> >> >> >
>> >> >> >  -jake
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >--
>> >> >
>> >> >  -jake
>> >>
>> >>
>> >
>> >
>> >--
>> >   Claudio Martella
>> >   claudio.martella@gmail.com
>>
>>
>
>
>-- 
>   Claudio Martella
>   claudio.martella@gmail.com


Mime
View raw message