hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Marron <>
Subject RE: Partition performance
Date Thu, 04 Jul 2013 09:25:34 GMT
Sorry, just caught up with the last couple of day’s email and I feel that this question
has already been answered fairly comprehensively. Apologies.


From: Peter Marron []
Sent: 04 July 2013 08:37
Subject: RE: Partition performance


Just to check that I understand this problem, my reading suggests that the overhead of
many partitions is currently unavoidable. Specifically this means that any query on a table
that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” data) even
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)



From: Owen O'Malley []
Sent: 02 July 2013 15:52
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <<>>
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that they should
be looking at. Most tables at 1000 partitions won't cause big slow downs, but the cost scales
with the number of partitions. By the time you are at 10,000 the cost is noticeable. I have
one customer who has a table with 1.2 million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so that I can
track them?
(It’s not currently a problem for me but I can see that I am going to need to be able to
explain the situation.)

I think this is the one they will use:

-- Owen
View raw message