Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Message-ID: <54C1B1E6.6080106@apache.org>
Date: Thu, 22 Jan 2015 18:28:54 -0800
From: Gopal V <gopalv@apache.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: chandra Reddy Bogala <chandra.reddy2005@gmail.com>,
 user@hive.apache.org
Subject: Re: Spark performance for small queries
References: 
 <urn:uuid:%3cCAHBQ1Gu7ty1--FdWGF7yzLSm6d=OSED=U4suFKt4Q5-fDF8eRg@mail-gmail-com%3e@localhost.localdomain>
In-Reply-To: 
 <urn:uuid:%3cCAHBQ1Gu7ty1--FdWGF7yzLSm6d=OSED=U4suFKt4Q5-fDF8eRg@mail-gmail-com%3e@localhost.localdomain>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 1/22/15, 4:36 PM, chandra Reddy Bogala wrote:

>        My question is related to GZIP files. I am sure single GZIP file is a
> anti pattern. Is small zip files (20 to 50 mb) also anti pattern. The
> reason I am asking this question is, my application collectors generate
> gzip files of that size. So I copy those to HDFS and add as a partition to
> hive tables and run queries every 15 min. In hour jobs, I convert to ORC
> with aggregations.

That is exactly the best practice for hive-13 and earlier. Small files, 
compressed and converted to columnar storage as part of a periodic 
compaction.

Your approach works very well and the reasons below are valid.

And for 2015, 15 minutes is a lot of time - assume you want something 
like 15 seconds. Plus, it has moving external parts (1 hour and 15 min 
crons etc).

There's a more native implementation of that stage-insert-compact idea 
in Hive-14.

Hive-14 has a different "streaming ingest" which allows you to do 
inserts into ORC at sub-minute intervals.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest#StreamingDataIngest-StreamingRequirements

You can connect a stream ingestion like Flume into that directly, to get 
the sub-minute data availability in ORC.

After those bits are in place, then you get to literally pick up the 
best of the whole Hadoop/YARN ecosystem and see how all of them work 
with Hive.

Once you go down that path, you can just move the raw data over Kafka, 
pump it through making a Storm topology, which accesses HBase via 
Trident, which persists data into a Hive Streaming sink.

That is roughly the state of the art for Hive - 1-2 seconds from raw 
data to query.

You should be able to find the "hive hbase storm bolt" example in the 
hortonworks trucking demo.

Cheers,
Gopal

> Two reasons I continue to use gzip files 1) I don't know or there is no way
> to convert my csv file to ORC at client(collector) side. Only need to use
> MR/Hive to convert. 2) Because these are small gzips each file is allocated
> to one mapper so the data to mapper/map is almost split size.
>
> Thanks,
> Chandra
>
> On Fri, Jan 23, 2015 at 5:01 AM, Gopal V <gopalv@apache.org> wrote:
>
>> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
>>
>>> We were comparing performance of some of our production hive queries
>>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
>>> Spark 0.9 and 1.1. We could see that the performance gains have been good
>>> in Spark.
>>>
>>
>> Is there any particular reason you are using an ancient & slow Hadoop-1.x
>> version instead of a modern YARN 2.0 cluster?
>>
>>  We tried a very simple query,
>>> select count(*) from T where col3=123
>>> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
>>> performance had been 2x better than Hive (120sec vs 60sec). Table T is
>>> stored in S3 and contains 600MB single GZIP file.
>>>
>>
>> Not sure if you understand that what you're doing is one of the worst
>> cases for both the platforms.
>>
>> Using a big single gzip file is like a massive anti-pattern.
>>
>> I'm assuming what you want is fast SQL in Hive (since this is the hive
>> list) along with all the other lead/lag functions there.
>>
>> You need a SQL oriented columnar format like ORC, mix with YARN and add
>> Tez, that is going to be somewhere near 10-12 seconds.
>>
>> Oh, and that's a ball-park figure for a single node.
>>
>> Cheers,
>> Gopal
>>
>
>