hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Fwd: Files Per Partition Causing Slowness
Date Tue, 02 Dec 2014 20:01:19 GMT
---------- Forwarded message ----------
From: John Omernik <john@omernik.com>
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org


Thank you Edward, I knew the number of partitions mattered,  but I
didn't think 1000 would be to much.  However, I didn't realize the
number of files per partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of
data, how would I setup bucketing to keep my partitions lower? I am
struggling to find a way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik <john@omernik.com> wrote:
>
> Thank you Edward, I knew the number of partitions mattered, and knew I was getting high,
however, I didn't realize the number of files per partition was also a fact prior to job submission.
>
> Thanks!
>
> John
>
> On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>>
>> This is discussed in the programming hive book. The more files the longer it takes
the job tracker to plan the job. The more tasks the more things the job tracker has to track.
The more partitions the more metastore lookups are required. All of these things limit throughput.
I do not like tables with more then 100 partitions above that I would switch to bucketing
or some other mechanism (application level partitioning)
>>
>> On Tue, Dec 2, 2014 at 12:25 PM, John Omernik <john@omernik.com> wrote:
>>>
>>> I am running Hive 0.12 in production, I have a table that ha 1100 partitions,
(flat, no multi level partitions) and in those partitions some have a small number of files
(5- 10) and others have quite a few files (up to 120).   The total table size is not "huge"
around 285 GB.
>>>
>>> While this is not terrible to my eyes, when I try to run a query on lots of partition
say all 1100, the time from query start to the time the query is submitted to the jobtracker
is horribly slow.  For example, it can take up to 3.5 minutes just to get to the point where
the job is seen in the job tracker.
>>> Is the number of files here what's hurting me? Is there some sort of per file
enumeration going on under the hood in Hive?  I ran Hive with debug mode on and saw lots of
file calls for each individual file... I guess I am curious for others out there who may have
similar tables, would a query like that take a horribly long time for you as well? Is this
"normal" or am I seeing issues here?
>>>
>>>
>>
>

Mime
View raw message