hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: Handling hierarchical data in Hive
Date Tue, 25 Mar 2014 10:55:52 GMT
see if this is what you are looking for https://github.com/sskaje/hive_merge




On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
saumitra.shahapure@vizury.com> wrote:

> Hello,
>
> We are using Hive to query S3 data. For one of our tables named analyze,
> we generate data hierarchically. First level of hierarchy is date and
> second level is a field named *generated_by*. e.g. for 20 march we may
> have S3 directories as
> s3://analyze/20140320/111/
> s3://analyze/20140320/222/
> s3://analyze/20140320/333/
> Size of files in each folders is typically small.
>
> Till now we have been using static partitioning so that queries on
> specific date and *generated_by* would be faster.
>
> Now problem is that number of *generated_by* folders is increased to
> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
> analyze on one month are slowed down.
>
> Is there any way to get rid of partitions, and at the same time maintain
> good  performance of queries which are fired on specific day and
> *generated_by*?
> --
> Regards,
> Saumitra Shahapure
>



-- 
Nitin Pawar

Mime
View raw message