Return-Path: X-Original-To: apmail-giraph-dev-archive@www.apache.org Delivered-To: apmail-giraph-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D2DBD786 for ; Mon, 29 Oct 2012 21:46:12 +0000 (UTC) Received: (qmail 35762 invoked by uid 500); 29 Oct 2012 21:46:12 -0000 Delivered-To: apmail-giraph-dev-archive@giraph.apache.org Received: (qmail 35630 invoked by uid 500); 29 Oct 2012 21:46:12 -0000 Mailing-List: contact dev-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@giraph.apache.org Delivered-To: mailing list dev@giraph.apache.org Received: (qmail 35604 invoked by uid 500); 29 Oct 2012 21:46:12 -0000 Delivered-To: apmail-incubator-giraph-dev@incubator.apache.org Received: (qmail 35568 invoked by uid 99); 29 Oct 2012 21:46:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2012 21:46:12 +0000 Date: Mon, 29 Oct 2012 21:46:12 +0000 (UTC) From: "Claudio Martella (JIRA)" To: giraph-dev@incubator.apache.org Message-ID: <1035005458.41100.1351547172406.JavaMail.jiratomcat@arcas> In-Reply-To: <1815214644.30039.1351209792489.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (GIRAPH-388) Improve the way we keep outgoing messages MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/GIRAPH-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486398#comment-13486398 ] Claudio Martella commented on GIRAPH-388: ----------------------------------------- Good work Maja. You got me thinking and I think your results make a lot of sense. With neighborhoods of 100 vertices and 40 workers, you'd expect to have an expected number of slightly over 2 neighbouring vertices in the same partition (100/39). This means that, even if we didn't stream messages out with buffering, but by kept them all in memory, we'd save a message every two. If you consider that we buffer a bit but we flush messages as they are produced, the number of combined messages is basically zero. This makes a lot of sense if you consider the original idea of the combiner in MapReduce. There, usually the cardinality of the key set of the original input is much higher than the one of the intermediate set that you feed to the reducer (otherwhise you wouldn't be reducing, right?). THERE, the combiner makes a lot of sense. Yes, we still have the same advantage of using a combiner as with PageRank on MapReduce, because there the cardinalities are the same as well (But the number of messages is higher, in fact the complexity is O(E), hence the combiner makes some sense). But the architecture of the shuffle and sort makes the cost of applying the combiner cheaper (amortized) compared to us. I'm always more convinced that the role of the combiner is mostly to save memory than anything else. So it should be mainly used server-side. > Improve the way we keep outgoing messages > ----------------------------------------- > > Key: GIRAPH-388 > URL: https://issues.apache.org/jira/browse/GIRAPH-388 > Project: Giraph > Issue Type: Improvement > Reporter: Maja Kabiljo > Assignee: Maja Kabiljo > Attachments: GIRAPH-388.patch > > > As per discussion on GIRAPH-357, in standard application chances that we get to use client-side combiner are very low. I experimented with benefits which we can get from not having the client-side combiner at all. It turns out that having a lot of maps in SendMessageCache, and then collection inside each of them, really hurts the performance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira