Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Message-ID: <54C18860.5060703@apache.org>
Date: Thu, 22 Jan 2015 15:31:44 -0800
From: Gopal V <gopalv@apache.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: "Saumitra Shahapure (Vizury)" <saumitra.shahapure@vizury.com>,
 user@hive.apache.org
Subject: Re: Spark performance for small queries
References: 
 <urn:uuid:%3cCAGP031vJxJFRC6mk94yyoFazSB7HZkQxguEd1VTSkRX6+0m0nw@mail-gmail-com%3e@localhost.localdomain>
In-Reply-To: 
 <urn:uuid:%3cCAGP031vJxJFRC6mk94yyoFazSB7HZkQxguEd1VTSkRX6+0m0nw@mail-gmail-com%3e@localhost.localdomain>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
> We were comparing performance of some of our production hive queries
> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
> Spark 0.9 and 1.1. We could see that the performance gains have been good
> in Spark.

Is there any particular reason you are using an ancient & slow 
Hadoop-1.x version instead of a modern YARN 2.0 cluster?

> We tried a very simple query,
> select count(*) from T where col3=123
> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
> performance had been 2x better than Hive (120sec vs 60sec). Table T is
> stored in S3 and contains 600MB single GZIP file.

Not sure if you understand that what you're doing is one of the worst 
cases for both the platforms.

Using a big single gzip file is like a massive anti-pattern.

I'm assuming what you want is fast SQL in Hive (since this is the hive 
list) along with all the other lead/lag functions there.

You need a SQL oriented columnar format like ORC, mix with YARN and add 
Tez, that is going to be somewhere near 10-12 seconds.

Oh, and that's a ball-park figure for a single node.

Cheers,
Gopal