Mailing-List: contact dev-help@hama.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hama.apache.org
Date: Fri, 22 Feb 2013 13:16:12 +0000 (UTC)
From: "Thomas Jungblut (JIRA)" <jira@apache.org>
To: dev@hama.apache.org
Message-ID: <JIRA.12625550.1356926049028.323198.1361538972766@arcas>
In-Reply-To: <JIRA.12625550.1356926049028@arcas>
References: <JIRA.12625550.1356926049028@arcas>
Subject: [jira] [Commented] (HAMA-704) Optimization of memory usage during
 message processing
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HAMA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584246#comment-13584246 ] 

Thomas Jungblut commented on HAMA-704:
--------------------------------------

bq.Create an array of say 8 or 16 spilling queues, basically partitioning the messages into vertice corresponding buckets. So spilling queue[0] would be responsible for messages for first numVertices/16 vertices, queues[1] for the next numVertices/16. 

That is cool, you just have to make sure that they are sorted and getCurrentMessage returns them in this order. Nothing much else is needed ;-)
Your approach has a bit of a problem, as the currentMessage must be the lowest in all the received messages. So a heap structure wouldn't be wrong in this case although very difficult to maintain the heap property on disk. Your bucketing must ensure that the order is correct, otherwise sorting the buckets in itself won't yield a correct working algorithm.

bq.For sorted spilling queue I would have to sort throughout the whole messages.

I guess we can't get over that easily. But I told you in chat already that sorting on sender side and merging on receiver side is in this case better and easier to implement.

bq.I do understand the messages don't have uniform distribution across vertices and we don't need a complete B-tree implementation.

The good thing is that a b-tree is maintaining order over all keys, so you can iterate messages with minimal number of disk seeks.

The message distribution is related to the number of active vertices- in case of pagerank every vertex is active for all iterations (that is why I benchmark it usually).
We track how many vertices are active, so we can give a hint to the queue how much it should cache. 

I guess we shouldn't get that fancy for the first shot, improvement can always be done later on.
                
> Optimization of memory usage during message processing
> ------------------------------------------------------
>
>                 Key: HAMA-704
>                 URL: https://issues.apache.org/jira/browse/HAMA-704
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.1
>
>         Attachments: HAMA-704_1337.patch, HAMA-704_1338.patch, HAMA-704.patch-v1, hama-704_v05.patch, HAMA-704-v2.patch, localdisk.patch, mytest.patch, patch.txt, patch.txt, removeMsgMap.patch
>
>
> <vertex, message> map seems consume a lot of memory. We should figure out an efficient way to reduce memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira