spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: spark sql versus interactive hive versus hive
Date Sat, 11 Feb 2017 08:22:15 GMT
I think this is a rather simplistic view. All the tools to computation in-memory in the end.
For certain type of computation and usage patterns it makes sense to keep them in memory.
For example, most of the machine learning approaches require to include the same data in several
iterative calculations. This is what Spark has been designed for. Most aggregations/precalculations
are just done by using the data in-memory once. Here is where Hive+Tez and to a limited extend
Spark can help. The third pattern, where users interactively query the data i.e. Many concurrent
users query the same or similar data very frequently, is addressed by Hive on Tez + Llap,
Hive Tez+ Ignite or Spark + ignite ( and there are other tools).

So it is important to understand what your users want to do.

Then, you have a lot of benchmark data on the web to compare. However I always recommend to
generate or use data yourself that fits to the data the company is using. Keep also in mind
that time is needed to convert this data in a efficient format.

> On 10 Feb 2017, at 20:36, Saikat Kanjilal <> wrote:
> Folks,
> I'm embarking on a project to build a POC around spark sql, I was wondering if anyone
has experience in comparing spark sql with hive or interactive hive and data points around
the types of queries suited for both, I am naively assuming that spark sql will beat hive
in all queries given that computations are mostly done in memory but want to hear some more
data  points around queries that maybe problematic in spark-sql, also are there debugging
tools people ordinarily use with spark-sql to troubleshoot perf related issues.
> I look forward to hearing from the community.
> Regards

View raw message