hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hamburg" by HyunsikChoi
Date Sat, 12 Sep 2009 15:43:06 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by HyunsikChoi:
http://wiki.apache.org/hadoop/Hamburg

The comment on the change is:
Some parts are revisied.

------------------------------------------------------------------------------
  [[TableOfContents(5)]]
  
  == Motivation ==
- The MapReduce (M/R) programming model is inappropriate to graph problems because of the
following reasons:
+ Large-scale graph processing has been being required in many areas, such as bioinformatics,
social networks, semantic web, and web information retrieval. However, existing systems cannot
deal with rapidly increasing volume of graph data. After advent of MapReduce (MR), many people
have expected that MR will be a nice solution for large-scale graph processing, and some of
them may be trying to find algorithms and solutions for large-scale graph processing with
M/R. However, even though MR is a great programming model having linear scalability, we argue
that for large-scale graph processing we need an alternative programming model to MR  because
of the following reasons:
  
-  * '''!MapReduce does not support traversing graph''' – A mapper reads input data sequentially,
and it can’t control its input data. In contrast, most of the graph problems are based on
walking vertices step by step. Walking vertices is to expand adjacent vertices from a given
vertex.  This operation is only available if current input data can be determined by the previous
operation. In MapReduce, however, the previous operation cannot affect the input data of the
next operation. Traversing a vertex is the most basic premitive operation in graph operations.
Consequently, graph processing with MapReduce is very limited. In order to come over this
limit, we have to avoid traverse of graph in order to solve graph problems ([http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=5076317
Graph Twiddling in a MapReduce World]) or have to use many M/R iterations each time walk vertices
([http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html Breadth-First Search
(B
 FS) & MapReduce]).
-  * '''!MapReduce limits to assigning one reducer''' - In a MapReduce problem on graph data,
assigning appropriate reducers according to their relations of partitioned graphs is very
hard. Assigning only one reducer is a straightforward way to solve their complex relations,
but it is apparent to cause deterioration of scalability.
+  * '''!MapReduce cannot support traversing graph''' – A mapper/reduce only provides sequential
access to input data, and we use M/R iterations in order to change the access pattern because
MR cannot control its next input data. In contrast, many of the graph problems are based on
walking vertices in step by step (i.e., graph traversing). Walking vertices implies expanding
adjacent vertices from a given vertex. This approach can be only available if the operation
by current input data can determine next input data. In MR, however, the current operation
cannot control the input data of the next operation. Consequently, graph processing with MapReduce
is very limited. In order to come over this limit, we have to avoid traverse of graph in order
to solve graph problems ([http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=5076317 Graph
Twiddling in a MapReduce World]) or have to perform many MR iterations ([http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html
  Breadth-First Search (BFS) & MapReduce]). As you know, the initialize cost of each MR
is very expensive.
+  * '''!MapReduce limits to assigning one reducer''' - When a MR program deal with some graph
program, assigning intermediate data to appropriate reducers by the partitioner according
to relations of partitioned graph data is very difficult because it is difficult to satisfy
the local sufficiency of data. Local sufficiency means that no data in difference sites is
needed to process a task. To the best of my knowledge, one of the most straightforward way
of this problem is to use only one reducer, but it is apparent to cause scalability problem.
-  * '''More complicated M/R program''' - To avoid graph traverse or the limit of one reduce,
the M/R programs would be inevitablely complicated with code to communicate data among data
nodes.
+  * '''More complicated M/R program''' - To avoid graph traverse or the limit of one reduce,
the M/R programs have to be inevitablely complicated and have to communicate data among data
nodes during each MR computation.
  
  Therefore, we need a new programming model for graph processing on Hadoop.
  
  == Goal ==
- 
   * Support graph traverse
   * Support a simple programming interface dealing graph data
   * Follow the scalability concept of shared-nothing architecture
   * Fault-Tolerant Implementation
  
  == Hamburg ==
- Hambrug is an alternative to M/R programming model. It is based on bulk synchronization
parallel (BSP) model. Like M/R, Hambrug takes advantages from shared-nothing architecture
(SN), so I expect that it will also show scalablity without almost degradation of performance
as the number of participant nodes increases.
+ Hambrug is an alternative to M/R programming model. It is based on bulk synchronization
parallel (BSP) model. Like MR, Hamburg will take advantages from shared-nothing architecture
(SN), so I expect that it will also show scalability without almost degradation of performance
as the number of participant nodes increases. In addition, we will provide a set of easy APIs
familiar with graph features and similar to MR.
+ 
  A Hamburg based on BSP computation step consists of three sub steps:
   * Computation on data that reside in local storage; it is similar to map operation in M/R.
   * Each node communicates its necessary data into one another.
@@ -31, +31 @@

  
  [http://lh4.ggpht.com/_DBxyBGtfa3g/SmQUYTHWooI/AAAAAAAABmk/cFVlLCdLVHE/s800/figure1.PNG]
  
- Each worker will process the data fragments stored locally. And then, We can do bulk synchronization
using collected communication data. The 'Computation' and 'Bulk synchronization' can be performed
iteratively, Data for synchronization can be compressed to reduce network usage. The main
difference between Hamburg and M/R is that Hamburg does not make intermediate data aggregate
into reducer. Instead, each computation node communicates only necessary data into one another.
 It will be efficient if total communicated data is smaller then intermediate data to be aggregated
into reducers. Plainly, It aims to improve the performance of traverse operations in Graph
computing. 
+ When a job is submitted, each worker starts with processing the data partitions that reside
in local storage. During local computation, each worker stores temporal data, which are needed
to transmitted to appropriate other workers, into a local queue. After all local computations
finish, each worker will perform bulk synchronization by using collected communication among
workers. The 'Computation' and 'Bulk synchronization' can be performed iteratively. Data for
synchronization can be compressed to reduce network usage. The main difference between Hamburg
and MR is that Hamburg does not make intermediate data aggregate into reducer. Instead, each
computation node communicates only necessary data into one another at each bulk synchronization
step. It will be efficient if total communicated data is smaller then intermediate data to
be aggregated into reducers. Plainly, It aims to improve the performance of traverse operations
in Graph computing. 
  
  === Initial contributors ===
   * Edward J. (edwardyoon AT apache.org)

Mime
View raw message