Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Fri, 23 May 2014 05:44:31 +0200
From: Sylvain Gault <sylvain.gault@inria.fr>
To: Marcos Ortiz <mlortiz@uci.cu>
Cc: user@hadoop.apache.org, arun c murthy <acm@hortonworks.com>
Subject: Re: MapReduce scalability study
Message-ID: <20140523034431.GH2823@localhost>
References: <20140522201742.GF2823@localhost>
 <1814372.OiaPCXo8Sg@f6-dat6-306-27>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1814372.OiaPCXo8Sg@f6-dat6-306-27>
User-Agent: Mutt/1.5.23 (2014-03-12)

On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)


Sylvain