spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Nunez <snu...@hortonworks.com>
Subject Re: Breaking the previous large-scale sort record with Spark
Date Fri, 10 Oct 2014 16:17:46 GMT
Great stuff. Wonderful to see such progress in so short a time.

How about some links to code and instructions so that these benchmarks can
be reproduced?

Regards,
- Steve

From:  Debasish Das <debasish.das83@gmail.com>
Date:  Friday, October 10, 2014 at 8:17
To:  Matei Zaharia <matei.zaharia@gmail.com>
Cc:  user <user@spark.apache.org>, dev <dev@spark.apache.org>
Subject:  Re: Breaking the previous large-scale sort record with Spark

> Awesome news Matei !
> 
> Congratulations to the databricks team and all the community members...
> 
> On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <matei.zaharia@gmail.com>
> wrote:
>> Hi folks,
>> 
>> I interrupt your regularly scheduled user / dev list to bring you some pretty
>> cool news for the project, which is that we've been able to use Spark to
>> break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x
>> fewer nodes. There's a detailed writeup at
>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-
>> record.html. Summary: while Hadoop MapReduce held last year's 100 TB world
>> record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23
>> minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>> 
>> I want to thank Reynold Xin for leading this effort over the past few weeks,
>> along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In
>> addition, we'd really like to thank Amazon's EC2 team for providing the
>> machines to make this possible. Finally, this result would of course not be
>> possible without the many many other contributions, testing and feature
>> requests from throughout the community.
>> 
>> For an engine to scale from these multi-hour petabyte batch jobs down to
>> 100-millisecond streaming and interactive queries is quite uncommon, and it's
>> thanks to all of you folks that we are able to make this happen.
>> 
>> Matei
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
> 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message