giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Presta <alessan...@fb.com>
Subject Re: LongDoubleFloatDoubleVertex
Date Fri, 01 Mar 2013 21:54:06 GMT
:-)

Sent from my iPhone

On Mar 1, 2013, at 1:39 PM, "Claudio Martella" <claudio.martella@gmail.com> wrote:

> You're totally right, I confused getEdgeValue(Vertex) with Vertex
> getEdgeByValue(E).
> In that case a user would have to index edges by value themselves.
> 
> You convinced me.
> 
> 
> On Fri, Mar 1, 2013 at 7:53 PM, Alessandro Presta <alessandro@fb.com> wrote:
> 
>> On 3/1/13 1:01 AM, "Claudio Martella" <claudio.martella@gmail.com> wrote:
>> 
>>> Imagine you're infering an "is_a" property. e.g. you have lion - is_a ->
>>> mammal - is_a -> animal. by propagation you'd infer lion - is_a -> animal.
>>> Clearly, i'm talking about a graph that does not only have is_a typed
>>> edges.
>>> 
>>> Two things can be noted: (1) one could filter out non is_a edges directly
>>> at loading in this case but (2) you'd want more complex of these, e.g. a
>>> path (a concatenation of is_a with lives_in or something?) to find out at
>>> query time or something.
>> 
>> What would be the vertex id in this case? I don't think it would be "is_a"
>> or "lives_in", right? That would probably be in the edge value.
>> Also, wouldn't you still need to read all the edges for this type of
>> application?
>> 
>>> 
>>> This is one example. I think most of our algorithms out there touch all
>>> the
>>> edges and iterate over (why else using such a big framework and loading
>>> the
>>> whole graph at first place?), so I'd optimise for them or avoid penalising
>>> them, but I would not remove from the API random access to edges either.
>>> After all, the idea is to have them implement their (to come) Edges class.
>> 
>> The problem is that with every method we add to the API we increase the
>> burden on the user.
>> Also, more requirements = more constraints = less opportunities for
>> optimization.
>> 
>> getEdgeValue() poses a few different issues:
>> - it will be slow in most implementations
>> - if we want it to at least be as space-efficient as getEdges(), it would
>> have to reuse the E object
>> - the signature for general graphs would have to be Iterable<E>
>> getEdgeValues(I); in order to have an easier interface for strict graphs,
>> we would have to make the inheritance structure more complex (adding
>> something like a StrictVertex)
>> 
>> That said, nothing prevents the user from adding that functionality
>> himself to VertexEdges, and then in Vertex casting getEdges() to the known
>> VertexEdges implementation, and call whatever method (getEdgeValue(),
>> whatnot) on it.
>> 
>>> 
>>> On Fri, Mar 1, 2013 at 9:37 AM, Alessandro Presta <alessandro@fb.com>
>>> wrote:
>>> 
>>>> On 3/1/13 12:33 AM, "Claudio Martella" <claudio.martella@gmail.com>
>>>> wrote:
>>>> 
>>>>> On Fri, Mar 1, 2013 at 5:13 AM, Alessandro Presta <alessandro@fb.com>
>>>>> wrote:
>>>>> 
>>>>>> On 2/28/13 5:08 PM, "Jake Mannix" <jake.mannix@gmail.com> wrote:
>>>>>> 
>>>>>>> On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>>>>>>> <alessandro@fb.com>wrote:
>>>>>>> 
>>>>>>>> It's not like it causes problems, it's just that it's a pretty
big
>>>>>>>> dependency to justify for a small example.
>>>>>>>> 
>>>>>>>> As for the motivation, if your point is to prove the framework's
>>>>>>>> superiority in some context, then you can use the simplest
>>>> possible
>>>>>>>> implementation (ArrayList).
>>>>>>> 
>>>>>>> This takes up LOTS of memory.  Primitives rule, objects drool
>>>> (memory).
>>>>>> 
>>>>>> Ok, but having to copy the keys and values to external arrays (which
>>>> is
>>>>>> what you have to do with Mahout's hash map) is even worse. A good
>>>>>> implementation of (long, double) edges (e.g. for RandomWalkVertex)
is
>>>>>> primitive arrays.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> The Giraph framework is all about iterating over edges, so
an
>>>>>>>> implementation that doesn't support that with reasonable
>>>> efficiency
>>>>>>>> doesn't make a lot of sense to me.
>>>>>>>> 
>>>>>>>> It also follows that hash map-based implementations only
make
>>>> sense
>>>>>> for
>>>>>>>> algorithms that make use of mutations.
>>>>>>> 
>>>>>>> Agreed, maps aren't absolutely necessary for immutable graphs.
 But
>>>>>> random
>>>>>>> access to collections of integers _is_ necessary for many
>>>> algorithms.
>>>>>>> It's
>>>>>>> not all just iteration over lists.
>>>>>> 
>>>>>> Can you give me some concrete examples of algorithms where random
>>>> access
>>>>>> to the edges is required?
>>>>>> I'm really interested in this, because I'm considering killing
>>>>>> getEdgeValue() if there are no use cases.
>>>>> 
>>>>> My example would be typed/labeled graphs (e.g. RDF) for reasoning and
>>>>> traversals based on particular patterns/path.
>>>> 
>>>> Can you be a bit more specific? I'm interested in knowing if this
>>>> algorithm would call getEdgeValue(id) for some target vertex id.
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> That said, something like a Trove hash map would probably
be more
>>>>>>>> appropriate (more efficient than the standard Java HashMap,
at the
>>>>>>>> expense
>>>>>>>> of generality). That could be a good candidate for a
>>>>>>>> LongDoubleHashMapEdges implementation.
>>>>>>>> I can give that a shot if it sounds good.
>>>>>>> 
>>>>>>> Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>>>>>> 
>>>>>> Whoops, totally missed that part.
>>>>>> 
>>>>>>> 
>>>>>>> What does Giraph depend on of Mahout now?  Just mahout-collections?
>>>>>>> That's
>>>>>>> not a very big dependency, and has all sorts of
>>>> primitive-to-primitive
>>>>>>> collections,
>>>>>>> and from what I've seen benchmarked, just as good or better than
>>>> Trove.
>>>>>>> Carrot2's
>>>>>>> hppc may be better yet, but I'm not sure if that is stably released
>>>>>> yet.
>>>>>> 
>>>>>> I'll take a look at HPPC and Mahout collections then. They all seem
>>>> to
>>>>>> provide the same stuff, so I'll consider benchmarks and convenience
>>>> of
>>>>>> the
>>>>>> API. Thanks for the pointers.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 2/28/13 4:03 PM, "Jake Mannix" <jake.mannix@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>>> Is the mahout dependency causing problems?
>>>>>>>>> 
>>>>>>>>> It would be nice if we could actually implement some
of the
>>>>>> algorithms
>>>>>>>>> that
>>>>>>>>> Mahout does via map-reduce in Giraph's BSP formalism,
to show off
>>>>>> how
>>>>>>>> it
>>>>>>>>> improves things.  Using the Mahout primitives can show
that it's
>>>> not
>>>>>>>> about
>>>>>>>>> the inner loop implementation, but the framework itself...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>>>>>>>>> <apache.mailbox@gmail.com>wrote:
>>>>>>>>> 
>>>>>>>>>> I like the idea of refactoring it into something
more
>>>> appropriate
>>>>>>>> for us
>>>>>>>>>> and ditching the Mahout dep. Good looking out.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella
<
>>>>>>>>>> claudio.martella@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I agree, at this point we could have a RandomWalkVertex
with
>>>>>> edge
>>>>>>>>>> values,
>>>>>>>>>>> and a "null-edged" vertex for the PR benchmarks.
>>>>>>>>>>> We make everybody happy and avoid code duplication.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
>>>>>>>> <alessandro@fb.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Gianmarco,
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, there will be more efficient implementations.
>>>>>>>>>>>> In the redesign I'm working on (GIRAPH-528),
there will be
>>>>>> only
>>>>>>>> one
>>>>>>>>>>> Vertex
>>>>>>>>>>>> class and edge storage is delegated to a
VertexEdges class.
>>>>>>>>>>>> So far I'm adding some generic implementations
>>>>>> (ByteArrayEdges,
>>>>>>>>>>>> ArrayListEdges, HashMapEdges) that work for
all types, and
>>>>>> some
>>>>>>>>>> optimized
>>>>>>>>>>>> ones (LongDoubleArrayEdges, LongNullArrayEdges).
>>>>>>>>>>>> 
>>>>>>>>>>>> Do you specifically need edge values to be
float while the
>>>>>> other
>>>>>>>>>> types
>>>>>>>>>>> are
>>>>>>>>>>>> double?
>>>>>>>>>>>> It seems to me it would make sense to change
>>>> RandomWalkVertex
>>>>>> to
>>>>>>>> use
>>>>>>>>>>>> double edge values instead, and avoid code
duplication
>>>> (i.e.
>>>>>>>> adding
>>>>>>>>>> a
>>>>>>>>>>>> LongFloatArrayEdges that's basically the
same). We're not
>>>>>> Trove
>>>>>>>>>> after
>>>>>>>>>>> all.
>>>>>>>>>>>> Makes sense?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the feedback,
>>>>>>>>>>>> 
>>>>>>>>>>>> Alessandro
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 2/28/13 1:54 AM, "Gianmarco De Francisci
Morales"
>>>>>>>>>> <gdfm@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Maybe the specific implementation can
be thrown away, but
>>>>>>>>>> personally I
>>>>>>>>>>>>> feel
>>>>>>>>>>>>> very strongly for the need of a good
LongDoubleFloatDouble
>>>>>>>> vertex.
>>>>>>>>>>>>> It's the base for any serious random
walk algorithm.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would call for a refactoring rather
than a removal.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just my 2c.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Gianmarco
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 28, 2013 at 7:54 AM, Alessandro
Presta
>>>>>>>>>>>>> <alessandro@fb.com>wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Does anyone feel strongly for
>>>> LongDoubleFloatDoubleVertex?
>>>>>>>>>>>>>> Reasons why I think it should be
removed:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>  1.  Right now it's incorrect (returns
target vertex
>>>> id as
>>>>>>>> edge
>>>>>>>>>>> value).
>>>>>>>>>>>>>>  2.  Iteration will always be inefficient,
since the
>>>>>>>> underlying
>>>>>>>>>>> Mahout
>>>>>>>>>>>>>> open-addressing hash map implementation
doesn't provide
>>>>>>>>>> iterators.
>>>>>>>>>> It
>>>>>>>>>>>>>> provides a way to copy the keys and
values to external
>>>>>>>>>> arrays/lists.
>>>>>>>>>>>>>>  3.  It's the only reason why we
have Mahout as a
>>>>>> dependency.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think we should strive to provide
model
>>>> implementations
>>>>>> that
>>>>>>>>>> are
>>>>>>>>>>>>>> generic
>>>>>>>>>>>>>> and/or extremely efficient. This
one satisfies neither.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Alessandro
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>>   Claudio Martella
>>>>>>>>>>>   claudio.martella@gmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> -jake
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> -jake
>>>>> 
>>>>> 
>>>>> --
>>>>>  Claudio Martella
>>>>>  claudio.martella@gmail.com
>>> 
>>> 
>>> --
>>>  Claudio Martella
>>>  claudio.martella@gmail.com
> 
> 
> -- 
>   Claudio Martella
>   claudio.martella@gmail.com

Mime
View raw message