hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheolsoo Park <>
Subject Re: Hive on Spark VS Spark SQL
Date Fri, 22 May 2015 05:31:57 GMT
Hi Xuefu,

Thanks for the good comparison. I agree with most points, but #1 isn't true.

SparkSQL has its own parser (implemented with Scala parser combinator
library), analyzer, and optimizer although they're not as mature as Hive.
What it depends on Hive for is Metastore, CliDriver, DDL parser, etc.


On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang <> wrote:

> I have been working on HIve on Spark, and knows a little about SparkSQL.
> Here are a few factors to be considered:
> 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
> front end (parser and semantic analyzer) and metastore, and inject in
> between a laryer where Hive's operator tree is reinterpreted in Spark's
> constructs (transactions and actions). Thus, it's tied to a specific
> version of Hive, which is always behind official Hive releases.
> 2. Because of the reinterpretation, many features (window functions,
> lateral views, etc) from Hive need to be reimplemented in Spark world. If
> an implementation hasn't been done, you see a gap. That's why you would
> expect functional disparity, not to mention future Hive futures.
> 3. SparkSQL is far from production ready.
> 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
> features and growing with Hive. Hive's operators are honored without
> re-interpretation. The integration is done at the execution layer, where
> Spark is nothing but an advanced MapReduce engine.
> 5. Hive is aiming at enterprise use cases, where there are more important
> concerns such as security than purely if it works or if it runs fast. Hive
> on Spark certainly makes the query run faster, but still keeps the same
> enterprise-readiness.
> 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
> needs to run some SQL. Or you're a casual SQL user and like to try
> something new.
> 7. If haven't touched either Spark or Hive, I'd suggest you start with
> Hive, especially for an enterprise.
> 8. If you're an existing Hive user and consider taking advantage of Spark,
> consider Hive on Spark.
> 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
> SparkSQL includes a version of Hive, which is very likely at a different
> version of the Hive that you have (even if you don't use Hive on Spark).
> Library conflicts can put you in a nightmare.
> 10. I haven't benchmarked SparkSQL myself, but I heard several reports
> that SparkSQL, when being tried at scale, is either fast or failing your
> queries.
> Hope this helps.
> Thanks,
> On Tue, May 19, 2015 at 10:38 PM, <
>> wrote:
>> Hive on Spark and SparkSQL which should be better , and what are the key
>> characteristics and the advantages and the disadvantages between ?
>> ------------------------------

View raw message