hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2050) batch processing partition pruning process
Date Wed, 23 Mar 2011 05:12:06 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010001#comment-13010001
] 

Ning Zhang commented on HIVE-2050:
----------------------------------

Note that this patch implements a simple API that passes a list of partition names rather
than a range of partition names. My performance testing indicates that bottleneck is not in
the JDO query itself. The JDO queries that getting the list of all MPartitions takes about
5 secs for a list of 20k partitions. However converting these 20k MPartitions to Partitions
took about 3 mins. Committing the transaction took another 3 mins. 

Note that converting MPartitions to Partitions and committing transactions are common operations.
Even though we use JDO pushdown (HIVE-2048) or use range queries, these costs are still there.
We need to optimize these costs away in the next step. 

> batch processing partition pruning process
> ------------------------------------------
>
>                 Key: HIVE-2050
>                 URL: https://issues.apache.org/jira/browse/HIVE-2050
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-2050.patch
>
>
> For partition predicates that cannot be pushed down to JDO filtering (HIVE-2049), we
should fall back to the old approach of listing all partition names first and use Hive's expression
evaluation engine to select the correct partitions. Then the partition pruner should hand
Hive a list of partition names and return a list of Partition Object (this should be added
to the Hive API). 
> A possible optimization is that the the partition pruner should give Hive a set of ranges
of partition names (say [ts=01, ts=11], [ts=20, ts=24]), and the JDO query should be formulated
as range queries. Range queries are possible because the first step list all partition names
in sorted order. It's easy to come up with a range and it is guaranteed that the JDO range
query results should be equivalent to the query with a list of partition names. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message