hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Hive/TEZ/Parquet
Date Thu, 15 Dec 2016 21:03:21 GMT
> The partition is by year/month/day/hour/minute. I have two directories - over two years,
and the total number of records is 50Million.  

That's a million partitions with 50 rows in each of them?

> I am seeing it takes more than 1hr to complete. Any thoughts, on what could be the issue
or approach that can be taken to improve the performance?

Looks like you have over-partitioned your data massively - the 1 hour might be partly query
planning with million partitions and the rest might be file-count related overheads.

At least in case of ORC, I recommend that the partitions contain at least 1 Gb of data &
that if you really need to query down to finer levels, to use bloom filters (PARQUET-41 is
not fixed yet, so YMMV) + sorted ordering.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4

Cheers,
Gopal



Mime
View raw message