Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by HyunsikChoi:
http://wiki.apache.org/hadoop/Hamburg
The comment on the change is:
Some parts are revisied.

[[TableOfContents(5)]]
== Motivation ==
 The MapReduce (M/R) programming model is inappropriate to graph problems because of the following reasons:
+ Largescale graph processing has been being required in many areas, such as bioinformatics, social networks, semantic web, and web information retrieval. However, existing systems cannot deal with rapidly increasing volume of graph data. After advent of MapReduce (MR), many people have expected that MR will be a nice solution for largescale graph processing, and some of them may be trying to find algorithms and solutions for largescale graph processing with M/R. However, even though MR is a great programming model having linear scalability, we argue that for largescale graph processing we need an alternative programming model to MR because of the following reasons:
 * '''!MapReduce does not support traversing graph''' – A mapper reads input data sequentially, and it can’t control its input data. In contrast, most of the graph problems are based on walking vertices step by step. Walking vertices is to expand adjacent vertices from a given vertex. This operation is only available if current input data can be determined by the previous operation. In MapReduce, however, the previous operation cannot affect the input data of the next operation. Traversing a vertex is the most basic premitive operation in graph operations. Consequently, graph processing with MapReduce is very limited. In order to come over this limit, we have to avoid traverse of graph in order to solve graph problems ([http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=5076317 Graph Twiddling in a MapReduce World]) or have to use many M/R iterations each time walk vertices ([http://blog.udanax.org/2009/02/breadthfirstsearchmapreduce.html BreadthFirst Search (B
FS) & MapReduce]).
 * '''!MapReduce limits to assigning one reducer'''  In a MapReduce problem on graph data, assigning appropriate reducers according to their relations of partitioned graphs is very hard. Assigning only one reducer is a straightforward way to solve their complex relations, but it is apparent to cause deterioration of scalability.
+ * '''!MapReduce cannot support traversing graph''' – A mapper/reduce only provides sequential access to input data, and we use M/R iterations in order to change the access pattern because MR cannot control its next input data. In contrast, many of the graph problems are based on walking vertices in step by step (i.e., graph traversing). Walking vertices implies expanding adjacent vertices from a given vertex. This approach can be only available if the operation by current input data can determine next input data. In MR, however, the current operation cannot control the input data of the next operation. Consequently, graph processing with MapReduce is very limited. In order to come over this limit, we have to avoid traverse of graph in order to solve graph problems ([http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=5076317 Graph Twiddling in a MapReduce World]) or have to perform many MR iterations ([http://blog.udanax.org/2009/02/breadthfirstsearchmapreduce.html
BreadthFirst Search (BFS) & MapReduce]). As you know, the initialize cost of each MR is very expensive.
+ * '''!MapReduce limits to assigning one reducer'''  When a MR program deal with some graph program, assigning intermediate data to appropriate reducers by the partitioner according to relations of partitioned graph data is very difficult because it is difficult to satisfy the local sufficiency of data. Local sufficiency means that no data in difference sites is needed to process a task. To the best of my knowledge, one of the most straightforward way of this problem is to use only one reducer, but it is apparent to cause scalability problem.
 * '''More complicated M/R program'''  To avoid graph traverse or the limit of one reduce, the M/R programs would be inevitablely complicated with code to communicate data among data nodes.
+ * '''More complicated M/R program'''  To avoid graph traverse or the limit of one reduce, the M/R programs have to be inevitablely complicated and have to communicate data among data nodes during each MR computation.
Therefore, we need a new programming model for graph processing on Hadoop.
== Goal ==

* Support graph traverse
* Support a simple programming interface dealing graph data
* Follow the scalability concept of sharednothing architecture
* FaultTolerant Implementation
== Hamburg ==
 Hambrug is an alternative to M/R programming model. It is based on bulk synchronization parallel (BSP) model. Like M/R, Hambrug takes advantages from sharednothing architecture (SN), so I expect that it will also show scalablity without almost degradation of performance as the number of participant nodes increases.
+ Hambrug is an alternative to M/R programming model. It is based on bulk synchronization parallel (BSP) model. Like MR, Hamburg will take advantages from sharednothing architecture (SN), so I expect that it will also show scalability without almost degradation of performance as the number of participant nodes increases. In addition, we will provide a set of easy APIs familiar with graph features and similar to MR.
+
A Hamburg based on BSP computation step consists of three sub steps:
* Computation on data that reside in local storage; it is similar to map operation in M/R.
* Each node communicates its necessary data into one another.
@@ 31, +31 @@
[http://lh4.ggpht.com/_DBxyBGtfa3g/SmQUYTHWooI/AAAAAAAABmk/cFVlLCdLVHE/s800/figure1.PNG]
 Each worker will process the data fragments stored locally. And then, We can do bulk synchronization using collected communication data. The 'Computation' and 'Bulk synchronization' can be performed iteratively, Data for synchronization can be compressed to reduce network usage. The main difference between Hamburg and M/R is that Hamburg does not make intermediate data aggregate into reducer. Instead, each computation node communicates only necessary data into one another. It will be efficient if total communicated data is smaller then intermediate data to be aggregated into reducers. Plainly, It aims to improve the performance of traverse operations in Graph computing.
+ When a job is submitted, each worker starts with processing the data partitions that reside in local storage. During local computation, each worker stores temporal data, which are needed to transmitted to appropriate other workers, into a local queue. After all local computations finish, each worker will perform bulk synchronization by using collected communication among workers. The 'Computation' and 'Bulk synchronization' can be performed iteratively. Data for synchronization can be compressed to reduce network usage. The main difference between Hamburg and MR is that Hamburg does not make intermediate data aggregate into reducer. Instead, each computation node communicates only necessary data into one another at each bulk synchronization step. It will be efficient if total communicated data is smaller then intermediate data to be aggregated into reducers. Plainly, It aims to improve the performance of traverse operations in Graph computing.
=== Initial contributors ===
* Edward J. (edwardyoon AT apache.org)