Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EED0917A81 for ; Fri, 23 Jan 2015 02:29:03 +0000 (UTC) Received: (qmail 86539 invoked by uid 500); 23 Jan 2015 02:28:57 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 86472 invoked by uid 500); 23 Jan 2015 02:28:57 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 86462 invoked by uid 99); 23 Jan 2015 02:28:57 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jan 2015 02:28:57 +0000 Received: from [192.168.1.108] (c-67-180-199-97.hsd1.ca.comcast.net [67.180.199.97]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 53DA61A0041; Fri, 23 Jan 2015 02:28:56 +0000 (UTC) Message-ID: <54C1B1E6.6080106@apache.org> Date: Thu, 22 Jan 2015 18:28:54 -0800 From: Gopal V User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:15.0) Gecko/20120907 Thunderbird/15.0.1 MIME-Version: 1.0 To: chandra Reddy Bogala , user@hive.apache.org Subject: Re: Spark performance for small queries References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 1/22/15, 4:36 PM, chandra Reddy Bogala wrote: > My question is related to GZIP files. I am sure single GZIP file is a > anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The > reason I am asking this question is, my application collectors generate > gzip files of that size. So I copy those to HDFS and add as a partition to > hive tables and run queries every 15 min. In hour jobs, I convert to ORC > with aggregations. That is exactly the best practice for hive-13 and earlier. Small files, compressed and converted to columnar storage as part of a periodic compaction. Your approach works very well and the reasons below are valid. And for 2015, 15 minutes is a lot of time - assume you want something like 15 seconds. Plus, it has moving external parts (1 hour and 15 min crons etc). There's a more native implementation of that stage-insert-compact idea in Hive-14. Hive-14 has a different "streaming ingest" which allows you to do inserts into ORC at sub-minute intervals. https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest#StreamingDataIngest-StreamingRequirements You can connect a stream ingestion like Flume into that directly, to get the sub-minute data availability in ORC. After those bits are in place, then you get to literally pick up the best of the whole Hadoop/YARN ecosystem and see how all of them work with Hive. Once you go down that path, you can just move the raw data over Kafka, pump it through making a Storm topology, which accesses HBase via Trident, which persists data into a Hive Streaming sink. That is roughly the state of the art for Hive - 1-2 seconds from raw data to query. You should be able to find the "hive hbase storm bolt" example in the hortonworks trucking demo. Cheers, Gopal > Two reasons I continue to use gzip files 1) I don't know or there is no way > to convert my csv file to ORC at client(collector) side. Only need to use > MR/Hive to convert. 2) Because these are small gzips each file is allocated > to one mapper so the data to mapper/map is almost split size. > > Thanks, > Chandra > > On Fri, Jan 23, 2015 at 5:01 AM, Gopal V wrote: > >> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote: >> >>> We were comparing performance of some of our production hive queries >>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both >>> Spark 0.9 and 1.1. We could see that the performance gains have been good >>> in Spark. >>> >> >> Is there any particular reason you are using an ancient & slow Hadoop-1.x >> version instead of a modern YARN 2.0 cluster? >> >> We tried a very simple query, >>> select count(*) from T where col3=123 >>> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark >>> performance had been 2x better than Hive (120sec vs 60sec). Table T is >>> stored in S3 and contains 600MB single GZIP file. >>> >> >> Not sure if you understand that what you're doing is one of the worst >> cases for both the platforms. >> >> Using a big single gzip file is like a massive anti-pattern. >> >> I'm assuming what you want is fast SQL in Hive (since this is the hive >> list) along with all the other lead/lag functions there. >> >> You need a SQL oriented columnar format like ORC, mix with YARN and add >> Tez, that is going to be somewhere near 10-12 seconds. >> >> Oh, and that's a ball-park figure for a single node. >> >> Cheers, >> Gopal >> > >