Mailing-List: contact user-help@hama.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hama.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <1295D514-4C48-487A-BFA0-65F3C06587C6@cse.uta.edu>
From: Leonidas Fegaras <fegaras@cse.uta.edu>
To: "user@hama.apache.org" <user@hama.apache.org>
In-Reply-To: 
 <CA+5TnhqPHFyjC11prWXfwRys-K4_RjUW=wGM9W4pEMVATXSSMQ@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Subject: Re: Implementing Hadoop map-reduce on Hama
Date: Thu, 11 Oct 2012 11:37:24 -0500
References: <4352895C-A07B-4E52-8D43-F02BDC9C4C3F@cse.uta.edu>
 <CA+5TnhqPHFyjC11prWXfwRys-K4_RjUW=wGM9W4pEMVATXSSMQ@mail.gmail.com>

Hi Apurv,
Allow me to disagree. For complex workflows and repetitive map-reduce  
jobs, such as PageRank,
a Hama implementation of map-reduce would be far superior. I haven't  
looked at your code yet but
I hope it does not use HDFS or memory for I/O for each map-reduce job.  
It must use streaming to pass
results across map-reduce jobs. If you are currently doing this, there  
is no point for me to step in.
But I really think that this would be very useful in practice, not  
just for demonstration.
Best regards,
Leonidas Fegaras

On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote:

> Hey Leonidas,
> Glad that you are thinking about it. IMO it would be good to have  
> such a
> functionality for demonstration purposes. But only for demonstration
> purposes. Its best to let hadoop do what's its best at and hama do  
> what its
> best at. ;) Suraj and I had tried hands on it in the past. Please  
> see [0]
> and [1]. Also I was working on such a module on my github account.  
> But I
> couldn't find much time. Basically its mostly the way you have  
> expressed in
> your last mail. Do you have a github account, I have already done  
> some work
> on github, I could share work it with you and divide the work if you  
> want?
>
> [0]
> http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
>     My POC of the Wordcount example on Hama. Super dirty and not  
> generic
> but works.
>
> [1] https://github.com/ssmenon/hama
> Suraj's in memory implementation.
>
>
> Let me know what you think
>
> --
> Regards,
> Apurv Verma
>
>
>
>
>
> On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras  
> <fegaras@cse.uta.edu>wrote:
>
>> I have seen some emails in this mailing list asking questions, such  
>> as:
>> I have an X algorithm running on Hadoop map-reduce. Is it suitable  
>> for
>> Hama?
>> I think it would be great if we had a good implementation of the
>> Hadoop map-reduce classes on Hama. Other distributed main-memory
>> systems have already done so. See:
>> M3R (http://vldb.org/pvldb/vol5/ 
>> **p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf 
>> >)
>> and Spark.
>> It is actually easier than you think. I have done something similar
>> for my query system, MRQL. What we need is to reimplement
>> org.apache.hadoop.mapreduce.**Job to execute one superstep for each
>> map-reduce job. Then a Hadoop map-reduce program that may contain
>> complex workflows and/or loops of map-reduce jobs would need minor
>> changes to run on Hama as a single BSPJob. Obviously, to implement
>> map-reduce in Hama, the mapper output can be shuffled to reducers
>> based on key by sending messages using hashing:
>> peer.getPeerName(key.**hashValue() % peer.getNumPeers())
>> Then the reducer superstep groups the data by the key in memory and
>> applies the reducer method. To handle input/intermediate data, we can
>> use a mapping from path_name to (count,vector) at each node. The
>> path_name is the path name of some input or intermediate HDFS file,
>> vector contains the data partition from this file assigned to the  
>> node, and
>> count is the max number of times we can scan this vector (after count
>> times, the vector is garbage-collected). The special case where
>> count=1 can be implemented using a stream (a Java inner class that
>> implements a stream Iterator). Given that the map-reduce Job output
>> is rarely accessed more than once, the translation of most map-reduce
>> jobs to Hama will not require any data to be stored in memory other
>> than those used by the map-reduce jobs. One exception is the graph
>> data that need to persist in memory across all jobs (then  
>> count=maxint).
>> Based on my experience with MRQL, the implementation of these ideas
>> may need up to 1K lines of Java code. Let me know if you are  
>> interested.
>> Leonidas Fegaras
>>
>>