hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sjayatheertha <>
Subject Re: Spark performance for small queries
Date Thu, 22 Jan 2015 18:00:44 GMT
I'm not answering your question but, could you give me more insight on where and how do you
use spark? I know that spark has in memory capabilities. 

Also, I have a similar question on ways to optimize hive queries and file storage. Which is
better Orc vs parquet along with when to use compressions

> On Jan 22, 2015, at 3:03 AM, "Saumitra Shahapure (Vizury)" <>
> Hello,
> We were comparing performance of some of our production hive queries between Hive and
Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see
that the performance gains have been good in Spark.
> We tried a very simple query, 
> select count(*) from T where col3=123 
> in both sparkSQL and Hive (with and found that Spark performance
had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB
single GZIP file.
> My question is, why Spark is faster than Hive here? In both of the cases, the file will
be downloaded, uncompressed and lines will be counted by a single process. For Hive case,
reducer will be identity function since is true.
> Note that disk spills and network I/O are very less for Hive's case as well,

View raw message