spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: Surprising Spark SQL benchmark
Date Sat, 01 Nov 2014 17:50:47 GMT
Kay,

Is this effort related to the existing AMPLab Big Data benchmark that
covers Spark, Redshift, Tez, and Impala?

Nick


2014년 10월 31일 금요일, Kay Ousterhout<keo@eecs.berkeley.edu>님이 작성한
메시지:

> There's been an effort in the AMPLab at Berkeley to set up a shared
> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
> we do frequently in the lab to evaluate new research.  Based on this
> thread, it sounds like making this more widely-available is something that
> would be useful to folks for reproducing the results published by
> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
> list as soon as we're done.
>
> -Kay
>
> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com
> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>> wrote:
>
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>>
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this
>> is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>>
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>>
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>>
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>>
>> Nick
>>
>>
>> 2014년 10월 31일 금요일, Steve Nunez<snunez@hortonworks.com
>> <javascript:_e(%7B%7D,'cvml','snunez@hortonworks.com');>>님이 작성한
메시지:
>>
>> > To be fair, we (Spark community) haven’t been any better, for example
>> this
>> > benchmark:
>> >
>> >         https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>> >
>> >
>> > For which no details or code have been released to allow others to
>> > reproduce it. I would encourage anyone doing a Spark benchmark in future
>> > to avoid the stigma of vendor reported benchmarks and publish enough
>> > information and code to let others repeat the exercise easily.
>> >
>> >         - Steve
>> >
>> >
>> >
>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.chammas@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>
>> > <javascript:;>> wrote:
>> >
>> > >Thanks for the response, Patrick.
>> > >
>> > >I guess the key takeaways are 1) the tuning/config details are
>> everything
>> > >(they're not laid out here), 2) the benchmark should be reproducible
>> (it's
>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>> > >happen).
>> > >
>> > >Probably key takeaways for any kind of benchmark, really...
>> > >
>> > >Nick
>> > >
>> > >
>> > >2014년 10월 31일 금요일, Patrick Wendell<pwendell@gmail.com
>> <javascript:_e(%7B%7D,'cvml','pwendell@gmail.com');> <javascript:;>>님이
>> > 작성한 메시지:
>> > >
>> > >> Hey Nick,
>> > >>
>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> > >> developers when running this. It is really easy to make one system
>> > >> look better than others when you are running a benchmark yourself
>> > >> because tuning and sizing can lead to a 10X performance improvement.
>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>> > >>
>> > >> There are a bunch of things that aren't clear here:
>> > >>
>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> > >> small tables which can make a 10X difference in TPC-H.
>> > >> 3. For data larger than memory, Spark SQL often performs better if
>> you
>> > >> don't call "cache", did they try this?
>> > >>
>> > >> Basically, a self-reported marketing benchmark like this that
>> > >> *shocker* concludes this vendor's solution is the best, is not
>> > >> particularly useful.
>> > >>
>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>> > >> directly involve Spark SQL developers in the future. Until then, I
>> > >> wouldn't give much credence to this or any other similar vendor
>> > >> benchmark.
>> > >>
>> > >> - Patrick
>> > >>
>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> > >> <nicholas.chammas@gmail.com
>> <javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>
>> <javascript:;> <javascript:;>> wrote:
>> > >> > I know we don't want to be jumping at every benchmark someone
posts
>> > >>out
>> > >> > there, but this one surprised me:
>> > >> >
>> > >> >
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> > >> >
>> > >> > This benchmark has Spark SQL failing to complete several queries
in
>> > >>the
>> > >> > TPC-H benchmark. I don't understand much about the details of
>> > >>performing
>> > >> > benchmarks, but this was surprising.
>> > >> >
>> > >> > Are these results expected?
>> > >> >
>> > >> > Related HN discussion here:
>> > >>https://news.ycombinator.com/item?id=8539678
>> > >> >
>> > >> > Nick
>> > >>
>> >
>> >
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message