incubator-giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: on the semantics of the combiner
Date Fri, 13 Jan 2012 17:34:58 GMT
Ok,

I guess we can vote then about this, what do you think?
Shall we take 72h?

I'm +1 for returning an iterable that can be empty.
I'm +1 for the returned iterable to be <= messages.size()


On Tue, Jan 10, 2012 at 9:48 PM, Sebastian Schelter <ssc@apache.org> wrote:
> I think we should make the combiner return a list/iterable that can
> potentially be empty. However we should assume that the number of
> elements returned is smaller than or equal to the number of input
> elements (whats the use of a combiner if this is not given?). I also
> concur that the code should not depend on the combiner being applied
> (similar to the way combiners work in hadoop).
>
> --sebastian
>
> 2012/1/10 Jakob Homan <jghoman@gmail.com>:
>> A composite object would essentially be a wrapper around a list and
>> introduce the need for all vertices to be ready to extract that list
>> at all times.  For instance, a combiner passed 10 messages may be able
>> to combine 7 of them but do nothing with the other three, leaving four
>> messages.  If we allow zero or one return elements, the combiner would
>> have to create a composite object with a list of those four messages,
>> whereas if we return a list, it just skips that step and returns the
>> four messages.  Additionally, the receiving vertex would have to
>> handle the possibility of a composite object every time even though
>> the combiner may or may not have been run during the superstep, or
>> even included in that job (since combiners are optional to the job
>> itself).  It would be better if one could write a Giraph application
>> that was completely agnostic of whether or not a combiner was
>> included.
>>
>> On Tue, Jan 10, 2012 at 12:00 PM, Claudio Martella
>> <claudio.martella@gmail.com> wrote:
>>> I believe the argument of not letting users shoot their foot doesn't
>>> stand :) Once you give them any API they have the power to do anything
>>> wrong, as they already can with Giraph (or anything else for what it
>>> matters), by designing an algorithm wrongly (which would be what it
>>> would turn out to be a wrong combiner). It's definitely true that a
>>> composite object would make the grouping (List<Group>) but I thought
>>> we were talking about simplifying life to users :). I think it would
>>> be more flexible (for the present and for the future) and also more
>>> elegant,  but not necessarily a must (although it'd come practically
>>> for free).
>>>
>>> Very cool discussion.
>>>
>>> On Tue, Jan 10, 2012 at 8:30 PM, Jakob Homan <jghoman@gmail.com> wrote:
>>>>> Combiners can only modify the messages sent to a single vertex, so they
can't send messages to other vertices.
>>>> Yeah, the more I've thought about this, the more problematic it would
>>>> be.  These new messages may be generated upon arrival at the
>>>> destination vertex (since combiners can be run on the receiving vertex
>>>> before processing as well).  When would they be forwarded to their new
>>>> destinations at that point?  It would be possible to get into a
>>>> feedback loop of messages jumping around before a superstep could ever
>>>> actually be done.
>>>>
>>>> That being said, our inability to think of a good application doesn't
>>>> mean there won't be one in the future, and it's probably better to be
>>>> more flexible than try to impose what appears optimal now.  The
>>>> benefit of forcing 0 or 1 message from a combiner seems less than the
>>>> flexibility of allowing another list of messages (which may or may not
>>>> be the same number of elements as the original, less than, or even
>>>> more than).
>>>>
>>>>>Good discussion (it's making me really think about this)!
>>>> Agreed.
>>>>
>>>>
>>>> On Tue, Jan 10, 2012 at 11:23 AM, Avery Ching <aching@apache.org> wrote:
>>>>> The general idea of combiners is to reduce the number of messages sent.
>>>>>  Combiners are purely an optimization and the application should work
>>>>> correctly without it (since it's never guaranteed to actually be called).
>>>>>  Combiners can only modify the messages sent to a single vertex, so
they
>>>>> can't send messages to other vertices.  Any other work (i.e. sending
>>>>> messages) should be done by the vertex in the compute() method.
>>>>>
>>>>> While I think that grouping behavior could actually be implemented within
a
>>>>> message object (still reducing the number of messages to 1 or 0) I suppose
>>>>> that in some simple cases (i.e. grouping), it might be easier by doing
it in
>>>>> the combiner as you both have mentioned?  The only thing I suppose I'm
>>>>> concerned about is letting users do something that is not optimal.
>>>>>  Generally, expanding messages is not what you want your combiner to
do.
>>>>>  Also, since grouping behavior can be implemented in the message object,
it
>>>>> forces users to avoid shooting themselves in the foot.
>>>>>
>>>>> Good discussion (it's making me really think about this)!
>>>>>
>>>>> Avery
>>>>>
>>>>>
>>>>> On 1/10/12 10:32 AM, Claudio Martella wrote:
>>>>>>
>>>>>> Ok, now i see where you're going. I guess that the thing here is
that
>>>>>> the combiner would "act" like (on its behalf) D, and to do so
>>>>>> concretely it would probably need some local data related to D (edges
>>>>>> values? vertexvalue?).
>>>>>> I also think that k>  n is also possible in principle and we
could let
>>>>>> the user decide whether to use this power or not, once/if we agree
>>>>>> that letting the user send k messages in the combiner is useful (and
>>>>>> the grouping behavior shown by the label propagation example should
do
>>>>>> so).
>>>>>>
>>>>>> On Tue, Jan 10, 2012 at 7:04 PM, Jakob Homan<jghoman@gmail.com>
 wrote:
>>>>>>>
>>>>>>> Those two messages would have gone to D, been expanded to, say,
4,
>>>>>>> which would have then then been sent to, say, M.  This would
save the
>>>>>>> sending of the two to D and send the 4 directly to M.  I'm not
saying
>>>>>>> it's a great example, but it is legal.  This is of course assuming
>>>>>>> that combiners can generate messages bound for vertices other
than the
>>>>>>> original destination, which I don't know if that has even been
>>>>>>> discussed.
>>>>>>>
>>>>>>> On Tue, Jan 10, 2012 at 9:49 AM, Claudio Martella
>>>>>>> <claudio.martella@gmail.com>  wrote:
>>>>>>>>
>>>>>>>> i'm not sure i understand what you'd save here. if the two
messages
>>>>>>>> were going to be expanded to k messages on the destination
worker D,
>>>>>>>> but you expand them on W, you end up sending k messages instead
of 2.
>>>>>>>> right?
>>>>>>>>
>>>>>>>> On Tue, Jan 10, 2012 at 6:26 PM, Jakob Homan<jghoman@gmail.com>
 wrote:
>>>>>>>>>>
>>>>>>>>>> it doesn't have to be expand, k, the number of elements
returned by
>>>>>>>>>> the combiner, can still be smaller than n,
>>>>>>>>>
>>>>>>>>> Right.  Grouping would be the most common case.  It
would be possible
>>>>>>>>> to be great than k, as well.  For instance, consider
two messages,
>>>>>>>>> both generated on the same worker (W) by two two different
vertices,
>>>>>>>>> both bound for another vertex, Z.  A combiner on W could
get both of
>>>>>>>>> these messages, do some work on them, as it would have
knowledge of
>>>>>>>>> both, and generate some arbitrary number of messages
bound for other
>>>>>>>>> vertices (thus saving the shuffle/transfer of the original
messages).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 10, 2012 at 12:08 AM, Claudio Martella
>>>>>>>>> <claudio.martella@gmail.com>  wrote:
>>>>>>>>>>
>>>>>>>>>> it doesn't have to be expand, k, the number of elements
returned by
>>>>>>>>>> the combiner, can still be smaller than n, the size
of the messages
>>>>>>>>>> parameter. as a first example, you can imagine your
vertex receiving
>>>>>>>>>> semantically-different classes/types of messages,
and you can imagine
>>>>>>>>>> willing to be summarizing them in different messages,
i.e. if your
>>>>>>>>>> messages come along with labels or just simply by
the source vertex,
>>>>>>>>>> if required by the algorithm, think of label propagation
to have just
>>>>>>>>>> an example, or some sort of labeled-pagerank.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 10, 2012 at 3:05 AM, Avery Ching<aching@apache.org>
>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>> I agree that C&A doesn't require it, however,
I can't think of why I
>>>>>>>>>>> would
>>>>>>>>>>> want to use a combiner to expand the number of
messages.  Can you?
>>>>>>>>>>>
>>>>>>>>>>> Avery
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 1/9/12 3:57 PM, Jakob Homan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my opinion that means reducing to
a single message or none at
>>>>>>>>>>>>> all.
>>>>>>>>>>>>
>>>>>>>>>>>> C&A doesn't require this, however.  Hadoop's
combiner interface, for
>>>>>>>>>>>> instance, doesn't require a single  or no
value to be returned; it
>>>>>>>>>>>> has
>>>>>>>>>>>> the same interface as a reducer, zero or
more values.  Would
>>>>>>>>>>>> adapting
>>>>>>>>>>>> the semantics of Giraph's combiner to return
a list of messages
>>>>>>>>>>>> (possibly empty) make it more useful?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 9, 2012 at 3:21 PM, Claudio Martella
>>>>>>>>>>>> <claudio.martella@gmail.com>    wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, what is you say is completely reasonable,
you convinced me :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jan 9, 2012 at 11:28 PM, Avery
Ching<aching@apache.org>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Combiners should be commutative and
associative.  In my opinion
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> means
>>>>>>>>>>>>>> reducing to a single message or none
at all.  Can you think of a
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>> more than 1 message should be returned
from a combiner?  I know
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> returning null isn't preferable in
general, but I think that
>>>>>>>>>>>>>> functionality
>>>>>>>>>>>>>> (returning no messages), is nice
to have and isn't a huge amount
>>>>>>>>>>>>>> of work
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> our side.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Avery
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 1/9/12 12:13 PM, Claudio Martella
wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To clarify, I was not discussing
the possibility for combine to
>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>> null. I see why it would be useful,
given that combine returns M,
>>>>>>>>>>>>>>> there's no other way to let combiner
ask not to send any message,
>>>>>>>>>>>>>>> although i agree with Jakob,
I also believe returning null should
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> avoided but only used, roughly,
as an init value for a
>>>>>>>>>>>>>>> reference/pointer.
>>>>>>>>>>>>>>> Perhaps, we could, but i'm just
thinking out loud here, let
>>>>>>>>>>>>>>> combine()
>>>>>>>>>>>>>>> return Iterable<M>, basicallly
letting it define what to combine
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> ({0, 1, k } messages). It would
be a powerful extension to the
>>>>>>>>>>>>>>> model,
>>>>>>>>>>>>>>> but maybe it's too much.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As far as the size of the messages
parameter, I agree with you
>>>>>>>>>>>>>>> that 0
>>>>>>>>>>>>>>> messages gives nothing to combine
and it would be somehow
>>>>>>>>>>>>>>> awkward, it
>>>>>>>>>>>>>>> was more a matter of synching
it with the other methods getting
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> messages parameter.
>>>>>>>>>>>>>>> Probably, having a more clear
javadoc will do the job here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jan 9, 2012 at 8:42 PM,
Jakob Homan<jghoman@gmail.com>
>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not a big fan of returning
null as it adds extra complexity
>>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>>> calling code (null checks,
or not, since people usually will
>>>>>>>>>>>>>>>> forget
>>>>>>>>>>>>>>>> them).  Avery is correct
that combiners are application
>>>>>>>>>>>>>>>> specific.  Is
>>>>>>>>>>>>>>>> it conceivable that one would
want to write a combiner that
>>>>>>>>>>>>>>>> returned
>>>>>>>>>>>>>>>> something for an input of
no parameters, ie combining the empty
>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>> doesn't return the empty
list?  I imagine for most combiners,
>>>>>>>>>>>>>>>> combining a single message
would result in that message.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Jan 9, 2012 at 11:28
AM, Avery Ching<aching@apache.org>
>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The javadoc for VertexCombiner#combine()
is
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  /**
>>>>>>>>>>>>>>>>>   * Combines message
values for a particular vertex index.
>>>>>>>>>>>>>>>>>   *
>>>>>>>>>>>>>>>>>   * @param vertexIndex
Index of the vertex getting these
>>>>>>>>>>>>>>>>> messages
>>>>>>>>>>>>>>>>>   * @param msgList List
of the messages to be combined
>>>>>>>>>>>>>>>>>   * @return Message
that is combined from {@link MsgList} or
>>>>>>>>>>>>>>>>> null if
>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>   *         message
it to be sent
>>>>>>>>>>>>>>>>>   * @throws IOException
>>>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think we are somewhat
vague on what a combiner can return to
>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>> various use cases.  A
combiner should be particular to a
>>>>>>>>>>>>>>>>> particular
>>>>>>>>>>>>>>>>> compute() algorithm.
 I think it should be legal to return null
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> combiner, in that case,
no message should be sent to that
>>>>>>>>>>>>>>>>> vertex.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It seems like it would
be an overhead to call a combiner when
>>>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>> messages.  I can't see
a case where that would be useful.
>>>>>>>>>>>>>>>>>  Perhaps we
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>> change the javadoc to
insure that msgList must contain at least
>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>> to have combine() being
called.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Avery
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 1/9/12 5:37 AM, Claudio
Martella wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> yes, that was my
point, I agree completely with you.
>>>>>>>>>>>>>>>>>> Fixing my test was
not the issue, my question was whether we
>>>>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>>>>> define explicitly
the semantics of this scenario.
>>>>>>>>>>>>>>>>>> Personally, I believe
the combiner should be ready to receive
>>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>>> messages, as it's
the case of BasicVertex::initialize(),
>>>>>>>>>>>>>>>>>> putMessages()
>>>>>>>>>>>>>>>>>> and compute(), and
act accordingly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the particular
example, I believe the SimpleSumCombiner is
>>>>>>>>>>>>>>>>>> bugged.
>>>>>>>>>>>>>>>>>> It's true that the
sum of no values is 0, but it's also true
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> null return semantics
of combine() is more suitable for this
>>>>>>>>>>>>>>>>>> exact
>>>>>>>>>>>>>>>>>> situation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jan 9, 2012
at 2:21 PM, Sebastian
>>>>>>>>>>>>>>>>>> Schelter<ssc@apache.org>
>>>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think we currently
implicitly assume that there is at least
>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> element in the
Iterable passed to the combiner. The messaging
>>>>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> invokes the combiner
only if at least one message for the
>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>> vertex
>>>>>>>>>>>>>>>>>>> has been sent.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However, we should
not rely on implicit implementation
>>>>>>>>>>>>>>>>>>> details but
>>>>>>>>>>>>>>>>>>> explicitly specify
the semantics of combiners.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --sebastian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 09.01.2012
13:29, Claudio Martella wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hello list,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for GIRAPH-45
I'm touching the incoming messages and hit an
>>>>>>>>>>>>>>>>>>>> interesting
problem with the combiner semantics.
>>>>>>>>>>>>>>>>>>>> currently,
my code fails testBspCombiner for the following
>>>>>>>>>>>>>>>>>>>> reason:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SimpleSumCombiner::compute()
returns a value even if there
>>>>>>>>>>>>>>>>>>>> are no
>>>>>>>>>>>>>>>>>>>> messages
in the iterator (in this case it returns 0) and for
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> reason the
vertices get activated at each superstep.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> At each superstep,
under-the-hood, I pass the combiner for
>>>>>>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>>>>> vertex
>>>>>>>>>>>>>>>>>>>> an Iterable,
which can be empty:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>     public
Iterable<M>        getMessages(I vertexId) {
>>>>>>>>>>>>>>>>>>>>      
Iterable<M>        messages =
>>>>>>>>>>>>>>>>>>>> inMessages.getMessages(vertexId);
>>>>>>>>>>>>>>>>>>>>      
if (combiner != null) {
>>>>>>>>>>>>>>>>>>>>      
        M combinedMsg;
>>>>>>>>>>>>>>>>>>>>      
        try {
>>>>>>>>>>>>>>>>>>>>      
                combinedMsg =
>>>>>>>>>>>>>>>>>>>> combiner.combine(vertexId,
>>>>>>>>>>>>>>>>>>>> messages);
>>>>>>>>>>>>>>>>>>>>      
        }  catch (IOException e) {
>>>>>>>>>>>>>>>>>>>>      
                throw new RuntimeException("could not
>>>>>>>>>>>>>>>>>>>> combine",
>>>>>>>>>>>>>>>>>>>> e);
>>>>>>>>>>>>>>>>>>>>      
        }
>>>>>>>>>>>>>>>>>>>>      
        if (combinedMsg != null) {
>>>>>>>>>>>>>>>>>>>>      
                List<M>        tmp = new
>>>>>>>>>>>>>>>>>>>> ArrayList<M>(1);
>>>>>>>>>>>>>>>>>>>>      
                tmp.add(combinedMsg);
>>>>>>>>>>>>>>>>>>>>      
                messages = tmp;
>>>>>>>>>>>>>>>>>>>>      
        } else {
>>>>>>>>>>>>>>>>>>>>      
                messages = new ArrayList<M>(0);
>>>>>>>>>>>>>>>>>>>>      
        }
>>>>>>>>>>>>>>>>>>>>      
}
>>>>>>>>>>>>>>>>>>>>      
return messages;
>>>>>>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> the Iterable
returned by this methods is passed to
>>>>>>>>>>>>>>>>>>>> basicVertex.putMessages()
right before the compute().
>>>>>>>>>>>>>>>>>>>> Now, the
question is: who's wrong? The combiner code that
>>>>>>>>>>>>>>>>>>>> returns
>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> sum of 0
over no values, or the framework that calls the
>>>>>>>>>>>>>>>>>>>> combiner
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> 0 messages?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>    Claudio Martella
>>>>>>>>>>>>>    claudio.martella@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>    Claudio Martella
>>>>>>>>>>    claudio.martella@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>    Claudio Martella
>>>>>>>>    claudio.martella@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>>
>>> --
>>>    Claudio Martella
>>>    claudio.martella@gmail.com



-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
View raw message