hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuefu Zhang <>
Subject Re: Hive on Spark VS Spark SQL
Date Wed, 20 May 2015 17:45:25 GMT
I have been working on HIve on Spark, and knows a little about SparkSQL.
Here are a few factors to be considered:

1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
front end (parser and semantic analyzer) and metastore, and inject in
between a laryer where Hive's operator tree is reinterpreted in Spark's
constructs (transactions and actions). Thus, it's tied to a specific
version of Hive, which is always behind official Hive releases.
2. Because of the reinterpretation, many features (window functions,
lateral views, etc) from Hive need to be reimplemented in Spark world. If
an implementation hasn't been done, you see a gap. That's why you would
expect functional disparity, not to mention future Hive futures.
3. SparkSQL is far from production ready.
4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
features and growing with Hive. Hive's operators are honored without
re-interpretation. The integration is done at the execution layer, where
Spark is nothing but an advanced MapReduce engine.
5. Hive is aiming at enterprise use cases, where there are more important
concerns such as security than purely if it works or if it runs fast. Hive
on Spark certainly makes the query run faster, but still keeps the same
6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
needs to run some SQL. Or you're a casual SQL user and like to try
something new.
7. If haven't touched either Spark or Hive, I'd suggest you start with
Hive, especially for an enterprise.
8. If you're an existing Hive user and consider taking advantage of Spark,
consider Hive on Spark.
9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
SparkSQL includes a version of Hive, which is very likely at a different
version of the Hive that you have (even if you don't use Hive on Spark).
Library conflicts can put you in a nightmare.
10. I haven't benchmarked SparkSQL myself, but I heard several reports that
SparkSQL, when being tried at scale, is either fast or failing your queries.

Hope this helps.


On Tue, May 19, 2015 at 10:38 PM, <> wrote:

> Hive on Spark and SparkSQL which should be better , and what are the key
> characteristics and the advantages and the disadvantages between ?
> ------------------------------

View raw message