hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Gault <sylvain.ga...@inria.fr>
Subject Re: MapReduce scalability study
Date Fri, 23 May 2014 03:44:31 GMT
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote:
> On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
> > Hello,
> >
> > I'm new to this mailing list, so forgive me if I don't do everything
> > right.
> >
> > I didn't know whether I should ask on this mailing list or on
> > mapreduce-dev or on yarn-dev. So I'll just start there. ^^
> >
> > Short story: I'm looking for some paper(s) studying the scalability
> > of Hadoop MapReduce. And I found this extremely difficult to find on
> > google scholar. Do you have something worth citing in a PhD thesis?
> >
> > Long story: I'm writing my PhD thesis about MapReduce and when I talk
> > about Hadoop I'd like to say "how much it scales". I heared two years
> > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan
> > to try on 6000 nodes" or something like that. I also heared that
> > YARN/MRv2 should scale better, but I don't plan to talk much about
> > YARN/MRv2. So I'd take anything I could cite as a reference in my
> > manuscript. :)
> 
> Hello, Sylvain.
> 
> One of the reason why the Hadoop dev team began to work in YARN is precisely
> looking for a more scalable and resourceful Hadoop system, so if you actually
> want to talk about Hadoop scalability, you should talk about YARN and MR2.
> 
>  
> 
> The paper is here:
> 
> https://developer.yahoo.com/blogs/hadoop/
> next-generation-apache-hadoop-mapreduce-3061.html
> 

This was a very interesting reading.
Maybe not very academic, but if that's all we got, I take it.

I also found these:
https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html

Somehow I was expecting that someone did a real scalability study
comparing MRv2 and MRv1. Comparing the total time of several benchmark
for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :)
But that's just how I would have done it. :)


> You should talk with Arun C Murthy, Chief Architect at Hortonworks about all
> these topics. He could help you much more than I could.

I'm convinced it would be very very interesting. But I do not have much
time to spend on understanding Hadoop and I still have several chapters
to write. :)

I almost have everything I needed to know about Hadoop. But when I'm
done, I may also ask people here to proof-read what I wrote about it. :)



Sylvain

Mime
View raw message