hama-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hama Wiki] Update of "GraphPackage" by edwardyoon
Date Wed, 18 Apr 2012 02:24:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.

The "GraphPackage" page has been changed by edwardyoon:

+ Please update http://people.apache.org/~edwardyoon/site/hama_graph_tutorial.html
- = The Graph Package (Angrapa) =
- The graph package, called Angrapa, is an large-scale graph data management framework for
analytical processing. It is still in heavy development. Angrapa will employ massive parallelism
on Hadoop, and It aims to achieve the scalability for processing tera bytes or peta bytes
graph data. Angrapa will be used in a variety of scientific and industrial areas, such as
data mining, machine learning, information retrieval, bioinformatics, and social networks,
required to process large-scale graph data.
- = The Main Goal =
-  * Easy APIs familiar to graph features
-  * Storing techniques and the data communication method (i.e., BSP) without deterioration
of graph data locality
- = An Overview of the Angrapa =
- The architecture of angrapa is similar to that of !MapReduce except it is founded on the
[[BulkSynchronizationParallel| BSP]]. It consists of one master and many walkers. One master
corresponds to a jobtracker of !MapReduce, and many walkers correspond to task trackers. Like
!MapReduce, angrapa will be carried out on data sets on HDFS or HBase.
- For processing, programs written in angrapa will be automatically parallelized and executed
on walkers. The processing of angrapa consists of a sequence of parallel supersteps. Each
superstep is also subdivided into three ordered phases, consisting of local computation in
each walker; communications among the walkers, leading to transfers of intermediate data between
walkers; and a barrier synchronization, waiting for all of the communications to complete.
- In the local computation phase, each walker polls a vertex ''v1'' - initially, it is given
by a user or the program, or it can be obtained from intermediate results transmitted from
other walkers after the first superstep. - from a local queue, and it starts to traverse from
''v1'' on graph data that reside in a local storage. During this phase, walker enqueues additional
vertices to be processed into the local queue. However, if some inserted vertex ''v2'' do
not reside in the local storage, the vertex ''v2'' is removed from the local queue and inserted
to the global queue that keeps vertices to be transmitted into other walkers at the next communication
- In the communication phase, each walker communicates intermediate data about vertices that
reside in local but are necessary to be computed with other vertices that reside in another
- In the barrier synchronization phase, the master controls all of the walkers with zookeepers.
If there are no intermediate data to be processed, the program will finish, and it is the
end of one superstep. Otherwise, next superstep will start with intermediate data transmitted
among walkers.
- = Example 1 =
- We assume that given a large-scale graph data and a start vertex, we find all shortest paths
from the start vertex to every vertices on the given graph data. In such a case, the program
written in angrapa starts from the start vertex. During the local computation phase, each
time walker dequeues a vertex from the local queue, each walker enqueues its adjacent vertices
with paths from the start vertex to the current dequeued vertex. Then, an enqueued vertex
is checked if it resides in the local partition. If not, it is picked to the global queue.
(To be described in more detail)
- = Example 2 =
- For example, we assume that given a subgraph ''s'', we find a set of subgraphs isomorphic
to the given subgraph ''s''. In such a case, angrapa will be very desirable because each walker
can independently begin with some start vertex matched to the subgraph ''s''. The program
will needs only one superstep if the subgraph ''s'' is small enough to be within a partition.
(To be described in more detail)

View raw message