hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6157) Fetching column stats slower than the 101 during rush hour
Date Fri, 17 Jan 2014 20:20:19 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875227#comment-13875227
] 

Sergey Shelukhin commented on HIVE-6157:
----------------------------------------

Ok, this took rather longer than expected... initially I tried to make stat fetching part
of partition pruning, this can be added as an extra optimization if necessary as this requires
too many API changes all over the place.
The alternative is simple, getting stat calls are all batched. New APIs on thrift use req/resp
pattern; requests contain db, table, column list, and partition list (for partitions). The
request returns whatever it can find (rather than the full list with some nulls, like the
old APIs that built lists using individual calls to metastore). The code then uses this. 
On metastore there's both JDO and SQL path for speed.
Also, cleaned up some stuff in StatOptimizer and StatsUtil that was generally suboptimal.

> Fetching column stats slower than the 101 during rush hour
> ----------------------------------------------------------
>
>                 Key: HIVE-6157
>                 URL: https://issues.apache.org/jira/browse/HIVE-6157
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Gunther Hagleitner
>            Assignee: Sergey Shelukhin
>
> "hive.stats.fetch.column.stats" controls whether the column stats for a table are fetched
during explain (in Tez: during query planning). On my setup (1 table 4000 partitions, 24 columns)
the time spent in semantic analyze goes from ~1 second to ~66 seconds when turning the flag
on. 65 seconds spent fetching column stats...
> The reason is probably that the APIs force you to make separate metastore calls for each
column in each partition. That's probably the first thing that has to change. The question
is if in addition to that we need to cache this in the client or store the stats as a single
blob in the database to further cut down on the time. However, the way it stands right now
column stats seem unusable.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message