hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saumitra Shahapure (Vizury)" <saumitra.shahap...@vizury.com>
Subject Handling hierarchical data in Hive
Date Tue, 25 Mar 2014 10:51:40 GMT
Hello,

We are using Hive to query S3 data. For one of our tables named analyze, we
generate data hierarchically. First level of hierarchy is date and second
level is a field named *generated_by*. e.g. for 20 march we may have S3
directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/
s3://analyze/20140320/333/
Size of files in each folders is typically small.

Till now we have been using static partitioning so that queries on specific
date and *generated_by* would be faster.

Now problem is that number of *generated_by* folders is increased to 1000s.
Everyday we end up adding 1000s of partitions to Hive. So queries on
analyze on one month are slowed down.

Is there any way to get rid of partitions, and at the same time maintain
good  performance of queries which are fired on specific day and
*generated_by*?
--
Regards,
Saumitra Shahapure

Mime
View raw message