hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Hive/TEZ/Parquet
Date Thu, 15 Dec 2016 22:59:55 GMT
 
> Actually, we don't have that many partitions - there are lot of gaps both in days and
time events as well.

Your partition description sounded a lot like one of the FAQs from Mithun's talks, which is
why I asked

http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches/24

> But, I would like to understand when you say " time spent might partly be query planning
with million partitions"? I presume, this is in producing the physical plan? -- does it spend
time in allocating group of partition directories to each map task

Yes, the physical planner is significant overhead, since all map-tasks get a list of all partitions
and match each read against that list (partition schema can evolve, this is closer to O(n^2))
& the split-generation is bottlenecked by the total # of files involved in the operation
(a slow O(n) operation is still slow).

The overhead in maintaining partitions is fairly high & the entire query planning will
try to do something like a du -sh on each partition, if all the basic file statistics are
missing etc.

Also, if you have >100k partitions, disabling the metadata optimizer (hive.optimize.metadataonly=false)
would be a good thing, since that codepath is single threaded, while a compute-heavy full-table
scan is much faster due to parallelism - it might burn more CPU, but it would come back in
less than an hour.

You might want to opt for daily partitions & also run the stats gathering ops with "analyze
table <table> compute statistics partialscan;" & "analyze table <table> compute
statistics for columns" to speed up further queries.

At least in my experience with ORC + valid stats, a query like "select count(*) from table"
should takes <500 milliseconds.

Cheers,
Gopal




Mime
View raw message