Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 96F3417568 for ; Thu, 22 Jan 2015 23:32:02 +0000 (UTC) Received: (qmail 49761 invoked by uid 500); 22 Jan 2015 23:32:00 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49698 invoked by uid 500); 22 Jan 2015 23:32:00 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49687 invoked by uid 99); 22 Jan 2015 23:32:00 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 23:32:00 +0000 Received: from [192.168.1.108] (c-67-180-199-97.hsd1.ca.comcast.net [67.180.199.97]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 540A91A003F; Thu, 22 Jan 2015 23:31:55 +0000 (UTC) Message-ID: <54C18860.5060703@apache.org> Date: Thu, 22 Jan 2015 15:31:44 -0800 From: Gopal V User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: "Saumitra Shahapure (Vizury)" , user@hive.apache.org Subject: Re: Spark performance for small queries References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote: > We were comparing performance of some of our production hive queries > between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both > Spark 0.9 and 1.1. We could see that the performance gains have been good > in Spark. Is there any particular reason you are using an ancient & slow Hadoop-1.x version instead of a modern YARN 2.0 cluster? > We tried a very simple query, > select count(*) from T where col3=123 > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark > performance had been 2x better than Hive (120sec vs 60sec). Table T is > stored in S3 and contains 600MB single GZIP file. Not sure if you understand that what you're doing is one of the worst cases for both the platforms. Using a big single gzip file is like a massive anti-pattern. I'm assuming what you want is fast SQL in Hive (since this is the hive list) along with all the other lead/lag functions there. You need a SQL oriented columnar format like ORC, mix with YARN and add Tez, that is going to be somewhere near 10-12 seconds. Oh, and that's a ball-park figure for a single node. Cheers, Gopal