Return-Path: X-Original-To: apmail-hama-dev-archive@www.apache.org Delivered-To: apmail-hama-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF661E085 for ; Fri, 22 Feb 2013 13:16:14 +0000 (UTC) Received: (qmail 65739 invoked by uid 500); 22 Feb 2013 13:16:14 -0000 Delivered-To: apmail-hama-dev-archive@hama.apache.org Received: (qmail 65494 invoked by uid 500); 22 Feb 2013 13:16:14 -0000 Mailing-List: contact dev-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hama.apache.org Delivered-To: mailing list dev@hama.apache.org Received: (qmail 65307 invoked by uid 99); 22 Feb 2013 13:16:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2013 13:16:13 +0000 Date: Fri, 22 Feb 2013 13:16:12 +0000 (UTC) From: "Thomas Jungblut (JIRA)" To: dev@hama.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HAMA-704) Optimization of memory usage during message processing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HAMA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584246#comment-13584246 ] Thomas Jungblut commented on HAMA-704: -------------------------------------- bq.Create an array of say 8 or 16 spilling queues, basically partitioning the messages into vertice corresponding buckets. So spilling queue[0] would be responsible for messages for first numVertices/16 vertices, queues[1] for the next numVertices/16. That is cool, you just have to make sure that they are sorted and getCurrentMessage returns them in this order. Nothing much else is needed ;-) Your approach has a bit of a problem, as the currentMessage must be the lowest in all the received messages. So a heap structure wouldn't be wrong in this case although very difficult to maintain the heap property on disk. Your bucketing must ensure that the order is correct, otherwise sorting the buckets in itself won't yield a correct working algorithm. bq.For sorted spilling queue I would have to sort throughout the whole messages. I guess we can't get over that easily. But I told you in chat already that sorting on sender side and merging on receiver side is in this case better and easier to implement. bq.I do understand the messages don't have uniform distribution across vertices and we don't need a complete B-tree implementation. The good thing is that a b-tree is maintaining order over all keys, so you can iterate messages with minimal number of disk seeks. The message distribution is related to the number of active vertices- in case of pagerank every vertex is active for all iterations (that is why I benchmark it usually). We track how many vertices are active, so we can give a hint to the queue how much it should cache. I guess we shouldn't get that fancy for the first shot, improvement can always be done later on. > Optimization of memory usage during message processing > ------------------------------------------------------ > > Key: HAMA-704 > URL: https://issues.apache.org/jira/browse/HAMA-704 > Project: Hama > Issue Type: Improvement > Components: graph > Reporter: Edward J. Yoon > Assignee: Edward J. Yoon > Priority: Critical > Fix For: 0.6.1 > > Attachments: HAMA-704_1337.patch, HAMA-704_1338.patch, HAMA-704.patch-v1, hama-704_v05.patch, HAMA-704-v2.patch, localdisk.patch, mytest.patch, patch.txt, patch.txt, removeMsgMap.patch > > > map seems consume a lot of memory. We should figure out an efficient way to reduce memory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira