incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <>
Subject Re: why we should remove implicit vertex creation
Date Tue, 17 Jan 2012 01:23:39 GMT
Hi Avery,

I inline the answers to your concerns.

On Fri, Jan 13, 2012 at 8:05 PM, Avery Ching <> wrote:
> Inline responses.
> Happy Friday,
> Avery
> On 1/13/12 10:51 AM, Claudio Martella wrote:
>> Hi Avery,
>> thanks for your feedback.
>> I'm advocating for allowing mutations only through Mutable interface
>> methods. I agree that it can come handy to have the implicit vertex
>> creation, for the reason you mentioned (which I called lazy inputset
>> creation), but you can obtain the same through a simple  single M/R
>> job run in advance.
> I think this is pretty expensive (extra MR job).  Users do have this option,
> but I doubt many would take it when they don't have to.

Agreed, but if a graph is used multiple times, so if it's the input to
multiple giraph jobs, the cost can be amortized quite a lot. You spend
once to "complete the graph", you take advantage multiple times.
Though I agree, the line here is quite fuzzy and I wouldn't haven
considered it if it wasn't for the next point.

>> What we win back is that we don't have the
>> computational cost, and code complexity of checking if the vertex
>> exists already for each message we get.
> Checking if the vertex exists is pretty cheap in a hashmap (constant time).
>  We should verify that this is a computational overhead (maybe some
> profiling) before optimizing it.  I suppose we could add a switch to bypass
> any graph mutation in general.

True, but you've taken the best-case scenario. If you use
range-partitioning then you hit a Tree. If we use another type of
partitioning it might be even worse (think if the partitioning was
held in a centralized store, maybe hbase, for topology partitioning).
This for the  vertex2partition function, which is not the expensive
part, I guess.

More than that I'm concerned about the exists() call to the vertex set
as I'm thinking about sorting the vertex set, to allow for better
iteration over out-of-core messages to completely avoid seeks. That
would make the exists() quite expensive. I guess we can re-open the
discussion once that code is ready.
About that, did you have any chance to have a look at my email on the
thread-safety of org.apache.giraph.graph.partition?


>> You know what I mean?
>> On Fri, Jan 13, 2012 at 7:44 PM, Avery Ching<>  wrote:
>>> Claudio,
>>> What are you advocating in particular?
>>> Graph mutation should be allowed (i.e. adding vertices).  We allow this
>>> to
>>> happen through the addVertexReq() interface and through the
>>> VertexResolver
>>> implementation (say for messages to non-existent vertices).  I can see
>>> why
>>> this would be useful.  Imagine you are computing page rank on the web
>>> graph,
>>> but you only have a subset of the sites, but all the outlinks for each
>>> site.
>>>  It is nice to be able to allow new vertices (sites) while running the
>>> application.
>>> I agree that the way that vertices are created and initialized is a bit
>>> vague.  We can work on improving the interfaces if anyone has
>>> suggestions.
>>> Avery
>>> On 1/13/12 12:10 AM, Claudio Martella wrote:
>>>> Hi Avery,
>>>> thanks for your feedback. I know that users can decide to drop this
>>>> behavior, but this doesn't mean that those three points don't hold, to
>>>> me.
>>>> On Fri, Jan 13, 2012 at 8:35 AM, Avery Ching<>
>>>>  wrote:
>>>>> Claudio,
>>>>> You are right that vertices are created automatically when messages are
>>>>> sent
>>>>> to non-existent vertices.  But that behavior can be made application
>>>>> specific.  The default resolution of mutations/messages is
>>>>> VertexResolver.
>>>>>  But you are always welcome to implement your own application specific
>>>>> behavior.  For instance, you might just want to drop the message.  If
>>>>> there
>>>>> is a simultaneous create/delete, you may want to always create.  You
>>>>> have
>>>>> the power to implement any behavior you want by setting the vertex
>>>>> resolver
>>>>> (see GiraphJob#setVertexResolverClass()).
>>>>> Hope this helps,
>>>>> Avery
>>>>> On 1/12/12 3:42 PM, Claudio Martella wrote:
>>>>>> Hello Giraphers,
>>>>>> I have a few comments about the current design of Giraph regarding
>>>>>> implicit creation of vertices.
>>>>>> As it's currently designed, if you send a message to a non-existent
>>>>>> vertices, Giraph creates it for you.
>>>>>> Although I can understand it can get handy as it allows for lazy
>>>>>> dataset creation, I think it comes at some cost and I believe this
>>>>>> cost is bigger than the advantage:
>>>>>> 1) it overlaps the mutation API, where a vertex can be created
>>>>>> explicitly when the semantics of the algorithm require it, with
>>>>>> knowledge about what's going on and with explicit state. This is
>>>>>> ambiguous and unclear part of the API which is difficult for me to
>>>>>> justify and probably confusing for the user too. Which brings me
>>>>>> the second point.
>>>>>> 2) it requires a different, and partially duplicate,code path for
>>>>>> mutations and implicit vertex creation in our code, as it's clear
>>>>>> looking at BasicRPCCommunication and as it's been experienced
>>>>>> currently by me in the email I recently sent to the list. Which brings
>>>>>> me to the third point.
>>>>>> 3) in order to manage this, for every message we have to hit, sooner
>>>>>> or later, the Worker vertices set to see if the vertex is existing
>>>>>> whether it should be implicitly created. This is computationally
>>>>>> expensive both if you have a HashMap but also if you have a TreeMap
>>>>>> for range partitioning. Also, if we're going to create more exotic
>>>>>> partitioning (topology-partitioning?), we're going to hit the problem
>>>>>> more.
>>>>>> In general, I don't know any graph API that doesn't require to either
>>>>>> list explicitly the vertex set at load or to create the vertex
>>>>>> explicitly through API. As I said, I understand it allows for lazy
>>>>>> creation of the input file, with possibly missing vertices explicitly
>>>>>> enlisted (missing as a source vertex but existing as an endpoint
>>>>>> an edge), but this could be really fixed robustly by a single
>>>>>> MapReduce job.
>>>>>> What do you guys think?

   Claudio Martella

View raw message