hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Shao <>
Subject Re: hive performance
Date Wed, 15 Apr 2009 21:56:51 GMT
HI Javateck,

In the compilation stage, Hive does get all partition names and test if they
can pass the WHERE condition or not.
This part of time will be linear to the number of partitions, although each
test should not take much time at all.

Otherwise, there is no difference (as you expected, Hive just submit the
files in the partitions that matter to hadoop).


On Wed, Apr 15, 2009 at 2:01 PM, javateck javateck <>wrote:

> in my design, it's like following:
> 1. for every hour, I'll generate a hourly data and load into hive table,
> around 3GB each hour in peak time, so I break them into 65MB chunks each, so
> my partition will be like 2009-04-15-09 and so on, there are 24 partitions
> for one day, so have 8760 partitions for one year
> 2. we need to keep up to 1 year of raw data, and I'll run around 25 queries
> on hourly basis, of course, it could run more than one hour depending on the
> data size, but hope that it can catch up during off-peak time.
> 3. currently I'm using jdbc to connect to hive standalone server, even
> thought it's not supporting multi-threading yet, but it should be there soon
> (a few weeks as I got the info from the forum), so I need to run the queries
> sequentially for now, and will change to multi-threading later on.
> I'm doing stress testing now, but it seems sometimes it's running faster
> and sometimes slower, right now I have around 15 partitions, and it runs
> much slower than just a few partitions, of course I have some code changes
> in between, but should not affect this, since loading data to hadoop and run
> hive queries are separated. I need to look into why it's getting slower.
> thanks,
> On Wed, Apr 15, 2009 at 1:42 PM, Stephen Corona <>wrote:
>> I am using Hive.
>> How many partitions do you have? In my setup, I am using partitions as
>> well. Each partition has 24 files, about 500MB each (so ~12GB per partition)
>> Steve
>> ________________________________________
>> From: javateck javateck []
>> Sent: Wednesday, April 15, 2009 4:28 PM
>> To:
>> Subject: Re: hive performance
>> thanks, Stephen, are you directly using hadoop or using hive?
>> I did not make the questions clear in my last email, I have hive
>> partitions, each partition has around 100 files, each with 65MB. When I
>> query, I'll just query specific partition. Previously I had fewer
>> partitions, it ran faster, but when the partition numbers grow, it's taking
>> longer time, I don't think partition is playing a role here, since when
>> doing mapreduce, I guess hive just use the specific partition to submit to
>> hadoop, I still need to look into another possible areas, but just want to
>> run fast cross the forum to see if anyone else has similar situation and
>> could shed some lights on it.
>> On Wed, Apr 15, 2009 at 1:09 PM, Stephen Corona <
>> <>> wrote:
>> Hi,
>> I'm not sure what kind of performance numbers you are looking for, but I
>> figured I would toss in a data point:
>> On a 10-node large EC2 cluster w/ EBS volumes, it takes me about 10
>> minutes to crunch through 300GB of CSV data  (120 million records).
>> DFS Replication = 1
>> Block Size = 128MB
>> Max Mappers = 60
>> ________________________________________
>> From: javateck javateck [<>]
>> Sent: Wednesday, April 15, 2009 4:00 PM
>> To:<>
>> Subject: hive performance
>> Hi,
>>  I want to check if hive data grows huge in the table (for example to
>> 200GB), does anybody see the mapreduce performance degrade a lot? I did not
>> factor things out, but just want to check first.
>>  thanks,


View raw message