hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <>
Subject Re: Handling hierarchical data in Hive
Date Tue, 25 Mar 2014 10:55:52 GMT
see if this is what you are looking for

On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <> wrote:

> Hello,
> We are using Hive to query S3 data. For one of our tables named analyze,
> we generate data hierarchically. First level of hierarchy is date and
> second level is a field named *generated_by*. e.g. for 20 march we may
> have S3 directories as
> s3://analyze/20140320/111/
> s3://analyze/20140320/222/
> s3://analyze/20140320/333/
> Size of files in each folders is typically small.
> Till now we have been using static partitioning so that queries on
> specific date and *generated_by* would be faster.
> Now problem is that number of *generated_by* folders is increased to
> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
> analyze on one month are slowed down.
> Is there any way to get rid of partitions, and at the same time maintain
> good  performance of queries which are fired on specific day and
> *generated_by*?
> --
> Regards,
> Saumitra Shahapure

Nitin Pawar

View raw message