hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal V <gop...@apache.org>
Subject Re: Spark performance for small queries
Date Thu, 22 Jan 2015 23:31:44 GMT
On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
> We were comparing performance of some of our production hive queries
> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
> Spark 0.9 and 1.1. We could see that the performance gains have been good
> in Spark.

Is there any particular reason you are using an ancient & slow 
Hadoop-1.x version instead of a modern YARN 2.0 cluster?

> We tried a very simple query,
> select count(*) from T where col3=123
> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
> performance had been 2x better than Hive (120sec vs 60sec). Table T is
> stored in S3 and contains 600MB single GZIP file.

Not sure if you understand that what you're doing is one of the worst 
cases for both the platforms.

Using a big single gzip file is like a massive anti-pattern.

I'm assuming what you want is fast SQL in Hive (since this is the hive 
list) along with all the other lead/lag functions there.

You need a SQL oriented columnar format like ORC, mix with YARN and add 
Tez, that is going to be somewhere near 10-12 seconds.

Oh, and that's a ball-park figure for a single node.

Cheers,
Gopal

Mime
View raw message