Return-Path: X-Original-To: apmail-hama-user-archive@www.apache.org Delivered-To: apmail-hama-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A41B1D809 for ; Thu, 11 Oct 2012 16:38:57 +0000 (UTC) Received: (qmail 24211 invoked by uid 500); 11 Oct 2012 16:38:57 -0000 Delivered-To: apmail-hama-user-archive@hama.apache.org Received: (qmail 24193 invoked by uid 500); 11 Oct 2012 16:38:57 -0000 Mailing-List: contact user-help@hama.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hama.apache.org Delivered-To: mailing list user@hama.apache.org Received: (qmail 24185 invoked by uid 99); 11 Oct 2012 16:38:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 16:38:57 +0000 X-ASF-Spam-Status: No, hits=-1.6 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [129.107.6.26] (HELO ironmaiden.uta.edu) (129.107.6.26) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 16:38:51 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArgTAJH1dlCBa37S/2dsb2JhbABEiGC6FgEBAwEBOAI0EAsIAw0gGVcGE4d+BguwfYkIi0caglmCTWADiFiOKoRviiyDC4FF X-IronPort-AV: E=Sophos;i="4.80,573,1344229200"; d="scan'208";a="155122868" Received: from unknown (HELO [129.107.126.210]) ([129.107.126.210]) by ironmaiden.uta.edu with ESMTP/TLS/AES128-SHA; 11 Oct 2012 11:38:27 -0500 Message-Id: <1295D514-4C48-487A-BFA0-65F3C06587C6@cse.uta.edu> From: Leonidas Fegaras To: "user@hama.apache.org" In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: Implementing Hadoop map-reduce on Hama Date: Thu, 11 Oct 2012 11:37:24 -0500 References: <4352895C-A07B-4E52-8D43-F02BDC9C4C3F@cse.uta.edu> X-Mailer: Apple Mail (2.936) Hi Apurv, Allow me to disagree. For complex workflows and repetitive map-reduce jobs, such as PageRank, a Hama implementation of map-reduce would be far superior. I haven't looked at your code yet but I hope it does not use HDFS or memory for I/O for each map-reduce job. It must use streaming to pass results across map-reduce jobs. If you are currently doing this, there is no point for me to step in. But I really think that this would be very useful in practice, not just for demonstration. Best regards, Leonidas Fegaras On Oct 11, 2012, at 11:04 AM, Apurv Verma wrote: > Hey Leonidas, > Glad that you are thinking about it. IMO it would be good to have > such a > functionality for demonstration purposes. But only for demonstration > purposes. Its best to let hadoop do what's its best at and hama do > what its > best at. ;) Suraj and I had tried hands on it in the past. Please > see [0] > and [1]. Also I was working on such a module on my github account. > But I > couldn't find much time. Basically its mostly the way you have > expressed in > your last mail. Do you have a github account, I have already done > some work > on github, I could share work it with you and divide the work if you > want? > > [0] > http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java > My POC of the Wordcount example on Hama. Super dirty and not > generic > but works. > > [1] https://github.com/ssmenon/hama > Suraj's in memory implementation. > > > Let me know what you think > > -- > Regards, > Apurv Verma > > > > > > On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras > wrote: > >> I have seen some emails in this mailing list asking questions, such >> as: >> I have an X algorithm running on Hadoop map-reduce. Is it suitable >> for >> Hama? >> I think it would be great if we had a good implementation of the >> Hadoop map-reduce classes on Hama. Other distributed main-memory >> systems have already done so. See: >> M3R (http://vldb.org/pvldb/vol5/ >> **p1736_avrahamshinnar_vldb2012.**pdf> >) >> and Spark. >> It is actually easier than you think. I have done something similar >> for my query system, MRQL. What we need is to reimplement >> org.apache.hadoop.mapreduce.**Job to execute one superstep for each >> map-reduce job. Then a Hadoop map-reduce program that may contain >> complex workflows and/or loops of map-reduce jobs would need minor >> changes to run on Hama as a single BSPJob. Obviously, to implement >> map-reduce in Hama, the mapper output can be shuffled to reducers >> based on key by sending messages using hashing: >> peer.getPeerName(key.**hashValue() % peer.getNumPeers()) >> Then the reducer superstep groups the data by the key in memory and >> applies the reducer method. To handle input/intermediate data, we can >> use a mapping from path_name to (count,vector) at each node. The >> path_name is the path name of some input or intermediate HDFS file, >> vector contains the data partition from this file assigned to the >> node, and >> count is the max number of times we can scan this vector (after count >> times, the vector is garbage-collected). The special case where >> count=1 can be implemented using a stream (a Java inner class that >> implements a stream Iterator). Given that the map-reduce Job output >> is rarely accessed more than once, the translation of most map-reduce >> jobs to Hama will not require any data to be stored in memory other >> than those used by the map-reduce jobs. One exception is the graph >> data that need to persist in memory across all jobs (then >> count=maxint). >> Based on my experience with MRQL, the implementation of these ideas >> may need up to 1K lines of Java code. Let me know if you are >> interested. >> Leonidas Fegaras >> >>