Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 87366D21A for ; Mon, 8 Oct 2012 20:53:40 +0000 (UTC) Received: (qmail 66962 invoked by uid 500); 8 Oct 2012 20:53:35 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 66641 invoked by uid 500); 8 Oct 2012 20:53:35 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 66634 invoked by uid 99); 8 Oct 2012 20:53:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Oct 2012 20:53:35 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jim.twensky@gmail.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Oct 2012 20:53:27 +0000 Received: by mail-ie0-f176.google.com with SMTP id k11so11949016iea.35 for ; Mon, 08 Oct 2012 13:53:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=iip8j+AyAjHKFVJbK1zDWoXr+V/XRx2iTUk4BV0xjvk=; b=ZXYuVHf4v2iCcWufunWBsUPqPBsjijzUv9nsyqn+gYg2zoMQp3rg4ImfTV5NzszJcn LIp0tmY4QVJyqMvEGkvTQ5sMqDJvHwHTc1kP3nX03r65qK4tZpcK+qI0xd+kYmzS55rm gT8/mBzpUllA4YmCwP7VPnNr0YbA2CLvszciawyYPSeukLcLqtcEIPafS40RUyGNFmbJ ITShkh9sGA90oGaCFaQ/pKQvQRibWhdRy4ZIM0cOl8LqXcV3FxJWdLXMlc/CDxO9b+SL /JGiATCHyYiyFqTfQJ19iNeROEZCbOqdgj+RyKBscIYEg4B/fpzNKkGiuw0XaiR20a++ bjVA== MIME-Version: 1.0 Received: by 10.50.157.234 with SMTP id wp10mr9610438igb.5.1349729586313; Mon, 08 Oct 2012 13:53:06 -0700 (PDT) Received: by 10.64.78.10 with HTTP; Mon, 8 Oct 2012 13:53:06 -0700 (PDT) In-Reply-To: References: Date: Mon, 8 Oct 2012 15:53:06 -0500 Message-ID: Subject: Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce From: Jim Twensky To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Mike, I'm already doing that but the output of the reduce goes straight back to HDFS to be consumed by the next Identity Mapper. Combiners just reduce the amount of data between map and reduce whereas I'm looking for an optimization between reduce and map. Jim On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel w= rote: > Well I was thinking ... > > Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer -> I= dentity Mapper -> combiner -> reducer... > > May make things easier. > > HTH > > 0Mike > > On Oct 8, 2012, at 2:09 PM, Jim Twensky wrote: > >> Thank you for the comments. Some similar frameworks I looked at >> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing >> large scale graph processing so I assumed one of them could serve the >> purpose. Here is a summary of what I found out about them that is >> relevant: >> >> 1) Haloop and Twister: They cache static data among a chain of >> MapReduce jobs. The main contribution is to reduce the intermediate >> data shipped from mappers to reducers. Still, the output of each >> reduce goes to the file system. >> >> 2) Cascading: A higher level API to create MapReduce workflows. >> Anything you can do with Cascading can be done practically by more >> programing effort and using Hadoop only. Bypassing map and running a >> chain of sort->reduce->sort->reduce jobs is not possible. Please >> correct me if I'm wrong. >> >> 3) Giraph: Built on the BSP model and is very similar to Pregel. I >> couldn't find a detailed overview of their architecture but my >> understanding is that your data needs to fit in distributed memory, >> which is also true for Pregel. >> >> 4) Hama: Also follows the BSP model. I don't know how the intermediate >> data is serialized and passed to the next set of nodes and whether it >> is possible to do a performance optimization similar to what I am >> asking for. If anyone who used Hama can point a few articles about how >> the framework actually works and handles the messages passed between >> vertices, I'd really appreciate that. >> >> Conclusion: None of the above tools can bypass the map step or do a >> similar performance optimization. Of course Giraph and Hama are built >> on a different model - not really MapReduce - so it is not very >> accurate to say that they don't have the required functionality. >> >> If I'm missing anything and.or if there are folks who used Giraph or >> Hama and think that they might serve the purpose, I'd be glad to hear >> more. >> >> Jim >> >> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel wrote: >>> I don't believe that Hama would suffice. >>> >>> In terms of M/R where you want to chain reducers... >>> Can you chain combiners? (I don't think so, but you never know) >>> >>> If not, you end up with a series of M/R jobs and the Mappers are just i= dentity mappers. >>> >>> Or you could use HBase, with a small caveat... you have to be careful n= ot to use speculative execution and that if a task fails, that the results = of the task won't be affected if they are run a second time. Meaning that t= hey will just overwrite the data in a column with a second cell and that yo= u don't care about the number of versions. >>> >>> Note: HBase doesn't have transactions, so you would have to think about= how to tag cells so that if a task dies, upon restart, you can remove the = affected cells. Along with some post job synchronization... >>> >>> Again HBase may work, but there may also be additional problems that co= uld impact your results. It will have to be evaluated on a case by case bas= is. >>> >>> >>> JMHO >>> >>> -Mike >>> >>> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon wrot= e: >>> >>>>> call context.write() in my mapper class)? If not, are there any other >>>>> MR platforms that can do this? I've been searching around and couldn'= t >>>> >>>> You can use Hama BSP[1] instead of Map/Reduce. >>>> >>>> No stable release yet but I confirmed that large graph with billions >>>> of nodes and edges can be crunched in few minutes[2]. >>>> >>>> 1. http://hama.apache.org >>>> 2. http://wiki.apache.org/hama/Benchmarks >>>> >>>> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky wr= ote: >>>>> Hi, >>>>> >>>>> I have a complex Hadoop job that iterates over large graph data >>>>> multiple times until some convergence condition is met. I know that >>>>> the map output goes to the local disk of each particular mapper first= , >>>>> and then fetched by the reducers before the reduce tasks start. I can >>>>> see that this is an overhead, and it theory we can ship the data >>>>> directly from mappers to reducers, without serializing on the local >>>>> disk first. I understand that this step is necessary for fault >>>>> tolerance and it is an essential building block of MapReduce. >>>>> >>>>> In my application, the map process consists of identity mappers which >>>>> read the input from HDFS and ship it to reducers. Essentially, what I >>>>> am doing is applying chains of reduce jobs until the algorithm >>>>> converges. My question is, can I bypass the serialization of the loca= l >>>>> data and ship it from mappers to reducers immediately (as soon as I >>>>> call context.write() in my mapper class)? If not, are there any other >>>>> MR platforms that can do this? I've been searching around and couldn'= t >>>>> see anything similar to what I need. Hadoop On Line is a prototype an= d >>>>> has some similar functionality but it hasn't been updated for a while= . >>>>> >>>>> Note: I know about ChainMapper and ChainReducer classes but I don't >>>>> want to chain multiple mappers in the same local node. I want to chai= n >>>>> multiple reduce functions globally so the data flow looks like: Map -= > >>>>> Reduce -> Reduce -> Reduce, which means each reduce operation is >>>>> followed by a shuffle and sort essentially bypassing the map >>>>> operation. >>>> >>>> >>>> >>>> -- >>>> Best Regards, Edward J. Yoon >>>> @eddieyoon >>>> >>> >> >