hama-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leonidas Fegaras <fega...@cse.uta.edu>
Subject Implementing Hadoop map-reduce on Hama
Date Thu, 11 Oct 2012 15:15:40 GMT
I have seen some emails in this mailing list asking questions, such as:
I have an X algorithm running on Hadoop map-reduce. Is it suitable for  
Hama?
I think it would be great if we had a good implementation of the
Hadoop map-reduce classes on Hama. Other distributed main-memory
systems have already done so. See:
M3R (http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf) and  
Spark.
It is actually easier than you think. I have done something similar
for my query system, MRQL. What we need is to reimplement
org.apache.hadoop.mapreduce.Job to execute one superstep for each
map-reduce job. Then a Hadoop map-reduce program that may contain
complex workflows and/or loops of map-reduce jobs would need minor
changes to run on Hama as a single BSPJob. Obviously, to implement
map-reduce in Hama, the mapper output can be shuffled to reducers
based on key by sending messages using hashing:
peer.getPeerName(key.hashValue() % peer.getNumPeers())
Then the reducer superstep groups the data by the key in memory and
applies the reducer method. To handle input/intermediate data, we can
use a mapping from path_name to (count,vector) at each node. The
path_name is the path name of some input or intermediate HDFS file,
vector contains the data partition from this file assigned to the  
node, and
count is the max number of times we can scan this vector (after count
times, the vector is garbage-collected). The special case where
count=1 can be implemented using a stream (a Java inner class that
implements a stream Iterator). Given that the map-reduce Job output
is rarely accessed more than once, the translation of most map-reduce
jobs to Hama will not require any data to be stored in memory other
than those used by the map-reduce jobs. One exception is the graph
data that need to persist in memory across all jobs (then count=maxint).
Based on my experience with MRQL, the implementation of these ideas
may need up to 1K lines of Java code. Let me know if you are interested.
Leonidas Fegaras


Mime
View raw message