spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Breaking the previous large-scale sort record with Spark
Date Thu, 06 Nov 2014 00:02:51 GMT
Congrats to everyone who helped make this happen. And if anyone has even more machines they'd
like us to run on next year, let us know :).

Matei

> On Nov 5, 2014, at 3:11 PM, Reynold Xin <rxin@databricks.com> wrote:
> 
> Hi all,
> 
> We are excited to announce that the benchmark entry has been reviewed by
> the Sort Benchmark committee and Spark has officially won the Daytona
> GraySort contest in sorting 100TB of data.
> 
> Our entry tied with a UCSD research team building high performance systems
> and we jointly set a new world record. This is an important milestone for
> the project, as it validates the amount of engineering work put into Spark
> by the community.
> 
> As Matei said, "For an engine to scale from these multi-hour petabyte batch
> jobs down to 100-millisecond streaming and interactive queries is quite
> uncommon, and it's thanks to all of you folks that we are able to make this
> happen."
> 
> Updated blog post:
> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
> 
> 
> 
> 
> On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <matei.zaharia@gmail.com>
> wrote:
> 
>> Hi folks,
>> 
>> I interrupt your regularly scheduled user / dev list to bring you some
>> pretty cool news for the project, which is that we've been able to use
>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
>> faster on 10x fewer nodes. There's a detailed writeup at
>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
>> Summary: while Hadoop MapReduce held last year's 100 TB world record by
>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>> 
>> I want to thank Reynold Xin for leading this effort over the past few
>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
>> providing the machines to make this possible. Finally, this result would of
>> course not be possible without the many many other contributions, testing
>> and feature requests from throughout the community.
>> 
>> For an engine to scale from these multi-hour petabyte batch jobs down to
>> 100-millisecond streaming and interactive queries is quite uncommon, and
>> it's thanks to all of you folks that we are able to make this happen.
>> 
>> Matei
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message