Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 99E4E10590 for ; Fri, 23 May 2014 03:45:04 +0000 (UTC) Received: (qmail 26437 invoked by uid 500); 23 May 2014 03:45:00 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 26306 invoked by uid 500); 23 May 2014 03:45:00 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 26299 invoked by uid 99); 23 May 2014 03:45:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 May 2014 03:45:00 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [192.134.164.104] (HELO mail3-relais-sop.national.inria.fr) (192.134.164.104) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 May 2014 03:44:54 +0000 X-IronPort-AV: E=Sophos;i="4.98,891,1392159600"; d="scan'208";a="63330802" Received: from dhcp-13-117.lip.ens-lyon.fr (HELO localhost) ([140.77.13.117]) by mail3-relais-sop.national.inria.fr with ESMTP/TLS/AES128-SHA; 23 May 2014 05:44:31 +0200 Date: Fri, 23 May 2014 05:44:31 +0200 From: Sylvain Gault To: Marcos Ortiz Cc: user@hadoop.apache.org, arun c murthy Subject: Re: MapReduce scalability study Message-ID: <20140523034431.GH2823@localhost> References: <20140522201742.GF2823@localhost> <1814372.OiaPCXo8Sg@f6-dat6-306-27> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1814372.OiaPCXo8Sg@f6-dat6-306-27> User-Agent: Mutt/1.5.23 (2014-03-12) X-Virus-Checked: Checked by ClamAV on apache.org On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote: > On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote: > > Hello, > > > > I'm new to this mailing list, so forgive me if I don't do everything > > right. > > > > I didn't know whether I should ask on this mailing list or on > > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > > > Short story: I'm looking for some paper(s) studying the scalability > > of Hadoop MapReduce. And I found this extremely difficult to find on > > google scholar. Do you have something worth citing in a PhD thesis? > > > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > > about Hadoop I'd like to say "how much it scales". I heared two years > > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > > to try on 6000 nodes" or something like that. I also heared that > > YARN/MRv2 should scale better, but I don't plan to talk much about > > YARN/MRv2. So I'd take anything I could cite as a reference in my > > manuscript. :) > > Hello, Sylvain. > > One of the reason why the Hadoop dev team began to work in YARN is precisely > looking for a more scalable and resourceful Hadoop system, so if you actually > want to talk about Hadoop scalability, you should talk about YARN and MR2. > > > > The paper is here: > > https://developer.yahoo.com/blogs/hadoop/ > next-generation-apache-hadoop-mapreduce-3061.html > This was a very interesting reading. Maybe not very academic, but if that's all we got, I take it. I also found these: https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html Somehow I was expecting that someone did a real scalability study comparing MRv2 and MRv1. Comparing the total time of several benchmark for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :) But that's just how I would have done it. :) > You should talk with Arun C Murthy, Chief Architect at Hortonworks about all > these topics. He could help you much more than I could. I'm convinced it would be very very interesting. But I do not have much time to spend on understanding Hadoop and I still have several chapters to write. :) I almost have everything I needed to know about Hadoop. But when I'm done, I may also ask people here to proof-read what I wrote about it. :) Sylvain