hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <>
Subject Re: Handling hierarchical data in Hive
Date Tue, 25 Mar 2014 11:52:19 GMT
in general when you have large number of partitions, your hive query
performance drops. This has been significantly addressed in current
releases but still see the performance issues. sadly I currently do not
have that larger dataset where I need to create large number of partitions.

This issue last time i checked was caused
by ObjectStore.getPartitionsByNames .  I am not sure this is same
implementation currently.

When you have large number of partitions, the actual time spent on query
planning increases,
one way was to set

you can also check the value of datanucleus.connectionPool.maxActive in
your hive config, if you can increase number of connections to your
metastore db.

we normally used to merge data for historical data into a single partition
column and then if required used to do a join between new data set and old
data sets. kind of a rolling data table and historical data table.

On Tue, Mar 25, 2014 at 4:55 PM, Saumitra Shahapure (Vizury) <> wrote:

> Hi Nitin,
> We are not facing small files problem since data is in S3. Also we do not
> want to merge files. Merging files are creating large analyze table for say
> one day would slow down queries fired on specific day and *generated_by.*
> Let me explain my problem in other words.
> Right now we are over-partitioning our table. Over-partitioning is giving
> us benefit that query on 1-2 partitions is too fast. It's side-effect is
> that If we try to query large number of partitions, query is too slow. Is
> there a way to get good performance in both of the scenarios?
> --
> Regards,
> Saumitra Shahapure
> On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar <>wrote:
>> see if this is what you are looking for
>> On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
>>> wrote:
>>> Hello,
>>> We are using Hive to query S3 data. For one of our tables named analyze,
>>> we generate data hierarchically. First level of hierarchy is date and
>>> second level is a field named *generated_by*. e.g. for 20 march we may
>>> have S3 directories as
>>> s3://analyze/20140320/111/
>>> s3://analyze/20140320/222/
>>> s3://analyze/20140320/333/
>>> Size of files in each folders is typically small.
>>> Till now we have been using static partitioning so that queries on
>>> specific date and *generated_by* would be faster.
>>> Now problem is that number of *generated_by* folders is increased to
>>> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
>>> analyze on one month are slowed down.
>>> Is there any way to get rid of partitions, and at the same time maintain
>>> good  performance of queries which are fired on specific day and
>>> *generated_by*?
>>> --
>>> Regards,
>>> Saumitra Shahapure
>> --
>> Nitin Pawar

Nitin Pawar

View raw message