hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HAMA-704) Optimization of memory usage during message processing
Date Thu, 21 Feb 2013 07:10:12 GMT

    [ https://issues.apache.org/jira/browse/HAMA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582976#comment-13582976

Thomas Jungblut commented on HAMA-704:

So I want to share the overall plan now.

Our requirements are:
 - Vertex and Edge Values (+possible user added state) change through every superstep
 - Vertex can be active or not, but if it is inactive it is just activated by a message until
it is voteToHalt()
 - Be as memory savvy as possible, for future FT needs it is necessary to spill changing parts
to disk to be restartable.

So how can we tackle this with an architectecture? 

*The graph part*

We divide the graph into two parts:
1. a static graph that only contains the ID's they are sorted ascending by the vertex id,
followed by it's outlink Ids. (VertexID1,EdgeID1,EdgeID2,VertexID2,EdgeID5 etc...).
2. a changing graph (soft/liquid graph) that contains the soft attributes that change very
often. It's format would be in a parallel fashion to the static graph so both can be read
in paralle. (VertexValue1,EdgeValue1,EdgeValue2,VertexValue2,EdgeValue5 etc...)

Both files are on disk, the static graph must be written while receiving the vertices from
the partitioning + beeing sorted. The soft part is written everytime to disk (or directmemory
or whatever sink defined by the VerticesInfo) during a superstep after the vertex computed
its new state.

Therefore we need some additional bookkeeping. For the activeness of vertex we can use a BitSet
that has an index mapping (we can index-map everything because it is sorted), e.G. bit 1 is
the vertex1 beeing active when set. To seek to the next active vertex without deserializing
the whole graph, we need another index mapping by a long[] that contains the byte-offset where
the values of a given vertex at the given index starts. This is majorly for seeking purposes,
so we can seek to the next active vertex in the set without deserializing the between part
of the graph. Anyways, this has really really low overhead in memory (pretty much none) and
it is scalable like hell so we can make GraphChi look like a real loser.

Suraj and me talked about it, and he is going to make the messaging more scalable- I focus
on the graph storages according to the plan above. Our interface between the two is a sorted
messaging (ascending by vertex ID) queue. As long as Suraj is working on it, I will use the
SortedMessagingQueue in memory to simulate the behaviour.

> Optimization of memory usage during message processing
> ------------------------------------------------------
>                 Key: HAMA-704
>                 URL: https://issues.apache.org/jira/browse/HAMA-704
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.1
>         Attachments: HAMA-704.patch-v1, hama-704_v05.patch, HAMA-704-v2.patch, localdisk.patch,
mytest.patch, patch.txt, patch.txt, removeMsgMap.patch
> <vertex, message> map seems consume a lot of memory. We should figure out an efficient
way to reduce memory.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message