hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal V <>
Subject Re: Spark performance for small queries
Date Fri, 23 Jan 2015 02:28:54 GMT
On 1/22/15, 4:36 PM, chandra Reddy Bogala wrote:

>        My question is related to GZIP files. I am sure single GZIP file is a
> anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The
> reason I am asking this question is, my application collectors generate
> gzip files of that size. So I copy those to HDFS and add as a partition to
> hive tables and run queries every 15 min. In hour jobs, I convert to ORC
> with aggregations.

That is exactly the best practice for hive-13 and earlier. Small files, 
compressed and converted to columnar storage as part of a periodic 

Your approach works very well and the reasons below are valid.

And for 2015, 15 minutes is a lot of time - assume you want something 
like 15 seconds. Plus, it has moving external parts (1 hour and 15 min 
crons etc).

There's a more native implementation of that stage-insert-compact idea 
in Hive-14.

Hive-14 has a different "streaming ingest" which allows you to do 
inserts into ORC at sub-minute intervals.

You can connect a stream ingestion like Flume into that directly, to get 
the sub-minute data availability in ORC.

After those bits are in place, then you get to literally pick up the 
best of the whole Hadoop/YARN ecosystem and see how all of them work 
with Hive.

Once you go down that path, you can just move the raw data over Kafka, 
pump it through making a Storm topology, which accesses HBase via 
Trident, which persists data into a Hive Streaming sink.

That is roughly the state of the art for Hive - 1-2 seconds from raw 
data to query.

You should be able to find the "hive hbase storm bolt" example in the 
hortonworks trucking demo.


> Two reasons I continue to use gzip files 1) I don't know or there is no way
> to convert my csv file to ORC at client(collector) side. Only need to use
> MR/Hive to convert. 2) Because these are small gzips each file is allocated
> to one mapper so the data to mapper/map is almost split size.
> Thanks,
> Chandra
> On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <> wrote:
>> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
>>> We were comparing performance of some of our production hive queries
>>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
>>> Spark 0.9 and 1.1. We could see that the performance gains have been good
>>> in Spark.
>> Is there any particular reason you are using an ancient & slow Hadoop-1.x
>> version instead of a modern YARN 2.0 cluster?
>>  We tried a very simple query,
>>> select count(*) from T where col3=123
>>> in both sparkSQL and Hive (with and found that Spark
>>> performance had been 2x better than Hive (120sec vs 60sec). Table T is
>>> stored in S3 and contains 600MB single GZIP file.
>> Not sure if you understand that what you're doing is one of the worst
>> cases for both the platforms.
>> Using a big single gzip file is like a massive anti-pattern.
>> I'm assuming what you want is fast SQL in Hive (since this is the hive
>> list) along with all the other lead/lag functions there.
>> You need a SQL oriented columnar format like ORC, mix with YARN and add
>> Tez, that is going to be somewhere near 10-12 seconds.
>> Oh, and that's a ball-park figure for a single node.
>> Cheers,
>> Gopal

View raw message